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The  maintenance  of  software  systems  usually  begins  with  considerable  effort 
spent  in  understanding  their  system  structures.  A  system  modularization  defines 
a  structure  of  the  system  in  terms  of  the  grouping  of  routines  into  modules  within 
the  system.  This  dissertation  presents  an  approach  to  obtain  a  modularization  of  a 
program  based  on  the  object-like  features  found  in  the  program. 

While  object-oriented  methodologies  for  software  design  and  development  have 
only  been  clearly  enunciated  in  the  last  few  years,  many  object-like  features  such 
as  data  grouping,  abstract  data  types,  and  inheritance  have  been  in  use  for  some 
time.  In  this  dissertation,  methodologies  which  aid  in  the  recovery  of  the  object- 
like features  of  a  program  written  in  a  non  object  oriented  programming  language 
are  explained.  Two  complementary  methods  for  "object"  identification  are  proposed 
which  focus  on  data  bindings  and  on  type  bindings  in  a  program.  The  proposed 
approach  looks  for  clusters  of  data,  structures,  and  routines,  that  are  analogous  to 
the  objects  and  object  classes  of  object-oriented  programming.  The  object  finder  is 
an  interactive  tool  that  combines  the  two  methods  while  using  human  input  to  guide 
the  object  identification  process.  The  experience  of  using  the  object  finder  and  two 
evaluation  studies  of  the  object  identification  methods  are  also  presented. 
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CHAPTER  1 
INTRODUCTION 

The  maintenance  of  software  systems  usually  begins  with  considerable  effort 
spent  in  understanding  their  system  structures  and  data.  A  system  modularization 
defines  a  structure  of  the  system  in  terms  of  the  grouping  of  routines  into  modules 
within  the  system.  This  dissertation  presents  an  approach  to  obtain  a  modularization 
of  a  program  based  on  the  object-like  features  found  in  the  program.  In  modifying 
existing  software,  professional  maintainers  are  almost  unanimous  in  identifying  the 
understanding  of  system  data  as  one  of  their  greatest  challenges.  Successful  main- 
tenance requires  precise  knowledge  of  the  data  items  in  the  system,  the  ways  these 
items  are  created  and  modified,  and  their  relationships.  Changing  a  program  without 
a  clear  vision  of  its  implicit  data  model  is  a  very  risky  undertaking. 

There  seems  to  be  little  work  that  has  explicitly  addressed  the  problem  of  data 
understanding  during  software  maintenance.  A  number  of  methodologies  attempt 
to  aid  human  program  understanding  of  program  constructs  by  cross  referencing,  by 
capturing  dependencies  [9,  46],  by  program  slicing  [13,  43],  or  by  the  ripple  effect 
analysis  of  changes  [50].  Other  tools  that  have  been  proposed  so  far,  such  as  those 
described  by  Ambras  and  O'Day  [1],  Biggerstaff  [4],  Kaiser  et  al.  [16],  Rich  and  Wa- 
ters [32],  and  Yau  and  Liu  [49],  use  knowledge-based  approaches  to  provide  inference 
capabilities.  As  a  result,  a  user  can  derive  additional  information  that  may  not  be 
explicit  in  the  program  code. 

Technologists  in  software  design  have  made  great  progress  in  abstracting  the 
ways  computer  programs  use  data.  Most  notable,  perhaps,  has  been  the  emergence 
of  the  concept  of  object-oriented  design  and  development.  Booch  defines  an  object  as 


"an  entity  whose  behavior  is  characterized  by  the  actions  it  suffers  and  that  it  requires 
of  other  objects"  [5].  In  practice,  most  objects  are  collections  of  data,  together  with 
the  methods  needed  to  access  and  manipulate  those  data. 

Although  object-oriented  programming  constructs  are  not  directly  supported 
in  conventional  programming  languages  such  as  "C"  and  "Ada,"  several  object-like 
features,  such  as  groupings  of  related  data,  abstract  data  types,  and  inheritance,  have 
been  in  use  for  some  time  and  may  occur  in  an  existing  program.  If  such  software 
needs  to  be  maintained,  it  would  be  highly  advantageous  to  identify  the  object-like 
features  in  the  system.  Knowledge  of  such  "objects"  would  be  important  to: 

1.  Understand  the  system's  design  more  precisely. 

2.  Facilitate  reuse  of  existing  "methods"  contained  in  the  system. 

3.  Avoid  degrading  the  existing  design  by  introducing  unnecessary  references  to 
data  that  should  be  private  to  a  given  class  of  "objects." 

4.  Reengineer  the  system  from  a  conventional  programming  language  (such  as 
"C")  into  an  object-oriented  language  (such  as  "C++")  to  facilitate  future 
enhancements. 

We  identified  two  important  factors  necessary  for  characterizing  objects:  global 
or  persistent  data  [18]  and  the  types  of  formal  parameters  and  return  values.  Each 
factor  will  in  turn  originate  a  new  algorithm  for  object  identification.  This  disser- 
tation presents  these  two  algorithms  for  identifying  object-like  features  in  existing 
source  code.  One  focuses  on  the  data  bindings  between  program  components  and  the 
other  on  type  bindings  between  program  components.  The  two  algorithms,  as  well 
as  their  implementations,  are  collectively  known  as  the  object  finder. 


The  Globals-based  Object  Finder  algorithm  uses  the  information  provided  by 
persistent  data  structures  to  identify  the  objects  in  a  program.  Data  bindings  be- 
tween system  components  have  been  previously  used  as  the  basis  for  clustering.  In 
hierarchical  clustering  [14],  for  example,  the  elements  chosen  for  grouping  are  the 
ones  with  the  smallest  "dissimilarity,"  that  is  to  say,  with  the  highest  number  of 
data  bindings  between  elements.  Our  approach  for  globals-based  object  identifica- 
tion provides  similar  capabilities  by  grouping  those  routines  which  access  a  common 
global  variable  into  "highly  connected"  objects.  This  algorithm  handles  procedural 
programming  languages,  such  as  "Ada,"  "C,"  "COBOL,"  "Pascal,"  or  "FORTRAN," 
that  provide  scoping  mechanisms  that  allow  the  definition  of  global  variables  and  side 
effects  on  those  global  variables. 

The  Types-based  Object  Finder  algorithm  uses  type  binding  as  the  basis  for 
grouping  routines  into  objects.  This  algorithm  groups  the  routines  according  to  the 
types  of  data  used  for  return  value  and  formal  parameters  of  routines.  Types-based 
object  identification  considers  the  "semantic"  information  provided  by  the  types  for 
the  clustering.  This  is  different  from  other  clustering  techniques  based  on  seman- 
tic information  such  as  conceptual  clustering  [35]  which  considers  "light  semantic" 
profiles  about  a  system  (from  detailed  cross  reference  information)  during  clustering. 
In  the  latter,  a  concept  tree  represents  the  common  features  of  the  members  of  a 
group,  such  as  the  names  that  a  software  unit  uses  ("names-used")  and  the  names  of 
the  places  where  it  is  used  ("user-names").  This  algorithm  handles  those  procedural 
programming  languages  which  provide  explicit  type  construction  mechanisms. 

This  dissertation  also  presents  the  experience  of  using  the  object  finder  algo- 
rithms and  an  evaluation  of  the  object  finder  algorithms  through  careful  examination 
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of  the  results.  Two  studies  were  performed  to  evaluate  the  object  identification  al- 
gorithms. In  study  I,  the  evaluation  consisted  of  comparing  the  groups  (identified 
objects)  identified  using  the  object  finder  with  those  groups  (clusters)  identified  us- 
ing hierarchical  clustering  [14].  Study  II  compared  the  identified  objects  found  in  a 
program  with  the  object-oriented  programming  classes  found  in  the  object-oriented 
version  of  the  program. 

The  comparison  in  study  I  was  based  on  the  complexity  of  the  two  partition- 
ings  resulting  from  each  clustering  technique.  The  measure  of  the  complexity  of 
the  partitionings  is  similar  to  coupling  and  cohesion  [6].  Thus,  we  defined  a  new 
set  of  complexity  metrics  factors  called  inter-group  complexity,  which  measures  the 
complexity  of  the  interface  of  a  group,  and  intra-group  complexity,  which  measures 
the  internal  complexity  of  a  group,  to  measure  the  complexity  of  a  given  group  in 
a  partitioning.  For  this  evaluation,  we  instrumented  the  program  with  access  pairs 
and  data  binding  triples  in  order  to  measure  those  metrics.  These  newly  defined 
complexity  metrics  factors  were  validated  and  the  results  are  reported  in  this  work. 
The  comparison  results  are  also  reported  in  this  dissertation. 

In  study  II,  a  "C"  program,  translated  from  a  "C++"  program,  is  used  to 
compare  the  objects  identified  in  the  "C"  program  with  the  classes  found  in  the 
"C++"  version  of  the  program.  This  example  also  illustrates  some  of  the  design 
recovery  capabilities  of  the  object  finder  when  abstracting  the  underlying  structure 
of  a  system. 

The  remainder  of  the  dissertation  is  organized  as  follows:  Chapter  2  provides  a 
brief  overview  of  design  recovery  and  complexity  metrics.  Chapter  3  introduces  the 
new  approach,  by  way  of  several  examples,  for  modularizing  an  existing  program  by 
identifying  the  object-like  features  found  in  the  program.    Chapter  4  discusses  the 


time  and  space  complexity  of  the  new  approach.  Chapter  5  presents  the  evaluation 
results  of  the  proposed  approach,  using  complexity  metrics,  and  discusses  the  new  set 
of  complexity  metrics  factors  and  their  validation.  Chapter  6  discusses  an  application 
of  the  proposed  approach  in  software  systems  design  recovery  and  introduces  a  new 
approach  for  design  recovery  either  top-down  or  bottom-up.  Chapter  7  outlines  the 
implementation  of  a  prototype  designed  to  demonstrate  the  flexibility  and  portability 
of  the  design  of  the  prototype.  Chapter  8  discusses  the  experience  of  using  the  ap- 
proach to  modularize  and  recover  the  design  information  from  several  programs.  The 
most  important  limitation  of  the  proposed  approach  is  that  it  is  currently  restricted 
to  two  criteria  of  object  identification.  Many  other  criteria  are  suggested  for  the 
future  study  including  object-oriented  principles,  such  as  the  classification  of  the  ob- 
jects in  the  application,  organization  of  the  objects,  and  identification  of  operations 
on  the  objects.  More  experimentation  is  also  needed  to  further  evaluate  the  object 
identification  approach,  and  other  metrics  factors  should  be  added  to  measure  the 
"object-orientedness"  of  the  recovered  design.  The  conclusions  and  the  future  study 
are  discussed  in  Chapter  9. 


CHAPTER  2 
BACKGROUND 


This  chapter  summarizes  some  relevant  background  information  on  design  re- 
covery and  complexity  metrics. 

2.1     Design  Recovery 

Design  recovery  is  defined  by  Biggerstaff  [4]  as  recreating  design  abstractions 
from  a  combination  of  code,  existing  design  documentation  (if  available),  personal 
experience,  and  general  knowledge  about  problem  and  application  domain.  Design 
recovery  must  generate  all  of  the  information  required  for  a  person  to  understand 
the  functionality  of  a  program,  how  it  accomplishes  its  responsibilities,  and  why  it 
performs  that  functionality. 

Design  recovery  is  common  and  critical  throughout  the  software  life  cycle.  The 
developer  of  new  software  usually  has  to  spend  a  great  deal  of  time  trying  to  un- 
derstand the  structure  of  similar  systems  and  system  components.  The  software 
maintainer  spends  most  of  his  or  her  time  studying  a  system's  structure  to  under- 
stand the  nature  and  effect  of  a  requested  change.  Without  fully  understanding  the 
similar  systems,  the  developer  may  not  harvest  the  reusable  components  and  reusable 
designs.  Without  fully  understanding  the  program  to  be  maintained,  the  maintainer 
may  conduct  inefficient  or  incorrect  program  modification. 

Without  automated  techniques,  design  recovery  is  difficult  and  time  consuming 
because  the  source  code  does  not  usually  contain  much  of  the  original  design  informa- 
tion, changes  are  not  adequately  documented,  there  is  no  change  stability,  and  there 
are  ripple  effects  when  making  changes.    Furthermore,  large  scale  software  worsen 


these  difficulties.  Thus,  automated  supports  for  the  understanding  process  are  very 
desirable. 

In  the  following  sections  we  survey  the  current  approaches  to  design  recovery. 

2.1.1     Clustering  Approaches 

The  main  characteristics  of  the  clustering  approach  are  that  it  shows  the  struc- 
tural analysis  of  large  software  systems  via  adequate  clustering  techniques  and  the 
retrieval  of  high  level  structural  information  from  the  existing  code  [23].  The  goal  is 
to  group  routines  into  modules  within  the  systems  to  reflect  the  modules  defined  by 
the  developer. 

Clustering  is  the  analysis  of  the  interface  between  the  components  of  a  system. 
It  helps  to  determine  the  modularization  that  those  interfaces  define.  The  modules 
defined  by  this  analysis  are  called  clusters. 

Clustering  techniques  with  data  bindings  have  been  well  studied,  but  the  tech- 
nique which  involves  the  type  bindings  approach  has  only  recently  been  published. 
Our  object  identification  approach  falls  under  the  latter  of  these  clustering  techniques. 
In  the  following  two  sections,  the  data  binding  approach  and  the  type  binding  ap- 
proach are  examined. 

2.1.1.1     Data  binding 

In  this  section,  we  explain  in  detail  the  clustering  techniques,  based  on  data 
bindings,  used  in  deriving  a  system's  clusters.  Later,  we  use  these  clusters  to  compare 
the  grouping  of  the  routines  derived  by  the  object  finder,  with  those  groups  derived 
by  these  clustering  techniques. 

Data  binding  [2,  41,  14]  reflects  the  interaction  among  the  components  of  a 
system;  it  has  been  previously  used  for  module  interaction  metrics  [2].  In  their 
work,  Hutchens  and  Basili  [14]  use  data  bindings  to  measure  the  interface  between 
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components  of  a  system  and  to  derive  system  clusters.  For  example,  assume  there 
are  two  procedures,  pi  and  p2,  and  a  variable  a;  in  a  program.  When  procedures 
Pi  and  p2  and  the  variable  x  are  in  the  same  static  scope,  whether  the  procedures 
access  the  variable  or  not,  this  is  called  a  potential  data  binding,  denoted  as  (pi,  x,p2) 
because  it  reflects  the  possibility  of  data  interaction  between  the  two  procedures.  If 
both  procedures  access  variable  x,  then  there  exists  a  used  data  binding.  More  work 
is  required  to  calculate  the  used  data  binding  than  the  potential  data  binding,  but 
the  former  reflects  a  similarity  between  px  and  p2.  If  procedure  pi  assigns  a  value  to 
variable  x  and  procedure  p2  references  x,  this  will  reflect  a  flow  of  information  from 
Pi  to  p2  via  the  variable  x.  This  is  called  actual  data  binding  and  it  is  difficult  to 
calculate  since  a  distinction  between  reference  and  assignment  must  be  maintained. 
Besides  these  bindings,  control  flow  data  binding  is  another  kind  of  data  binding,  but 
it  requires  considerably  more  computation  effort  than  actual  data  binding,  because 
static  data  flow  analysis  must  be  performed. 

After  the  data  binding  information  is  available,  the  system  components  can 
be  grouped  based  on  the  strength  of  their  relationships  with  each  other  [3].  This 
is  derived  by  some  specialized  clustering  techniques.  The  use  of  data  bindings  in 
clustering  analysis  provides  meaningful  pictures  of  high  level  system  interactions  and 
even  system  "fingerprints,"  which  are  descriptive  analogies  to  galactic  star  systems. 
Next,  we  summarize  Hutchens  and  Basili's  hierarchical  clustering  modularization 
approach  [14]  and  the  current  implementation  [21]  of  their  approach. 

The  nature  of  Hutchens  and  Basili's  clustering  algorithms  is  bottom-up  or  ag- 
glomerative  since  they  iteratively  create  larger  and  larger  groups,  until  the  elements 
have  coalesced  into  a  single  cluster.  Thus,  this  is  a  module  composition  technique. 
The  elements  chosen  for  grouping  are  the  ones  with  the  smallest  dissimilarity.  There 


are  several  methods  for  computing  the  dissimilarity  [14];  the  clustering  algorithms 
used  by  Hutchens  and  Basili  correspond  to  the  "single-link"  algorithm  [14]  which 
takes  the  smallest  dissimilarity  between  the  elements  of  each  pair  of  newly  formed 
clusters  as  the  new  coefficient  between  them. 

The  first  step  in  hierarchical  clustering  is  to  abstract  the  data  to  obtain  a  binding 
matrix  which  represents  the  number  of  data  bindings  between  any  two  components 
of  the  system.  This  matrix  is  a  symmetric  matrix.  The  current  implementation 
considers  all  levels  of  data  binding  up  to  actual  data  bindings. 

The  second  step  is  to  obtain  a  dissimilarity  matrix  using  one  of  two  alternative 
methods: 

1.  Recomputed  Bindings:  based  on  the  percentage  of  the  bindings  that  connect  to 
either  of  two  components  of  the  system.  The  dissimilarity  matrix  p  is  defined 
by  d(i,j)  —  p(i,j)  =  (sumi  +  sumj  —  2b(i,j))/(sumi  +  sumj  —  b(i,j))  where 
sumi  is  the  number  of  data  bindings  in  which  component  i  occurs  and  sumj 
is  the  number  of  data  bindings  in  which  component  j  occurs.  In  this  case, 
p(i,j)  is  the  probability  that  "a  random  data  binding  chosen  from  the  union 
of  all  bindings  associated  with  i  or  j  is  not  in  the  intersection  of  all  bindings 
associated  with  i  or  j"  [14]. 

2.  Expected  Bindings:  a  weight  is  assigned  to  each  binding  level  relative  to  the  total 
number  of  elements  under  consideration  in  a  given  iteration.  The  dissimilarity 
matrix  p  is  defined  by  d(i,j)  =  (k/(n  —  l))/bind(i,j)  where  n  is  the  number  of 
elements  under  consideration  and  k  is  the  number  of  bindings  involving  either 
element  i  or  element  j.  Thus,  one  would  expect  kf(n  —  1)  of  the  bindings  to 
be  between  i  and  j. 
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The  current  implementation  of  this  clustering  technique  uses  expected  bindings  to 
compute  the  dissimilarity  matrix. 

During  the  third  step,  the  new  clusters  are  formed  by  grouping  together  those 
elements  whose  dissimilarity  is  the  smallest.  Then,  a  new  binding  matrix  is  obtained 
from  the  clusters  and  the  process  starts  again  from  this  new  binding  matrix. 

Hutchens  and  Basili  summarize  the  problem  of  using  data  bindings  only  for 
modularization  as  that  "whenever  a  module  that  defines  an  abstract  data  type  and 
has  no  local  data  that  is  shared  among  the  operations  on  the  type,  [the  module]  will 
not  be  located  using  this  method"  [14].  They  explain  that  this  is  due  to  the  fact  that 
there  is  no  direct  data  binding  between  the  operations  of  the  module  and  that  all  the 
interactions  are  indirect  through  the  procedures  that  use  the  abstraction. 

The  algorithms  for  object  identification  in  this  dissertation  present  a  solution 
to  this  problem  that  consists  of  using  data  types  and  establishing  "relationships" 
between  such  types  and  the  routines  that  use  them  for  formal  parameters  or  return 
values.  Then,  the  abstract  data  type  is  revealed  from  the  hiding  imposed  by  the 
abstraction  mechanisms. 

2.1.1.2     Type  binding 

Type  binding  methodology  [22]  is  the  most  recently  published  design  recovery 
technique.  It  analyzes  a  conventional  procedural  programs  in  the  context  of  an  object- 
oriented  paradigm. 

Due  to  the  gradual  software  paradigm  progression  from  a  purely  procedural 
approach  to  an  object-based  approach  and  now  to  the  object-oriented  approach, 
object-like  features,  such  as  data  grouping,  abstract  data  type,  and  inheritance,  al- 
ready exist  in  conventional  programming  languages  such  as  "C"  and  "Ada."  It  is 
very  likely  that  the  object-oriented  concepts  have  been  used  in  existing  programs  for 
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some  time.  Type  binding  methodology  consists  of  identifying  the  objects  in  the  con- 
ventional procedural  programs  in  order  to  recover  the  underlying  system  structures 
[22].  The  focus  of  this  dissertation,  the  object  finder,  is  a  methodology  that  combines 
both  the  data  binding  approach  and  the  type  binding  approach  in  modularizing  a 
software  system. 

The  object  finder  is  somewhat  analogous  to  other  software  clustering  methods, 
but  it  is  unique  in  searching  for  a  particular  kind  of  cluster  which  is  similar  to  an 
abstract  data  type  or  object  and  cannot  be  clustered  by  a  data  binding  methodology. 

2.1.2     Program  Slicing  Approach 

The  concept  of  program  slicing  is  originally  discussed  by  Mark  Weiser  [43]. 
Weiser  defines  a  slicing  criterion  as  a  pair  (p,  V),  where  p  is  a  program  point  and  V 
is  a  subset  of  the  program's  variables.  In  his  work,  a  slice  of  program  consists  of  all 
statements  and  predicates  of  the  program  that  might  affect  the  values  of  variables  in 
V  at  point  p.  Weiser's  work  has  been  improved  by  Horwitz  et  al.  [13],  on  the  problem 
of  interprocedural  slicing  —  generating  a  slice  of  an  entire  program,  where  the  slice 
crosses  the  boundaries  of  procedure  calls. 

The  program  slicing  technique  can  help  a  programmer  understand  complicated 
source  codes  by  isolating  individual  computation  threads  within  a  program.  It  can 
also  aid  in  debugging  and  be  used  for  automatic  parallelization.  Program  slicing  is 
also  used  for  automatically  integrating  program  variants;  in  this  case,  slices  are  used 
to  compute  a  safe  approximation  to  the  change  in  behavior  between  a  program  P  and 
a  modified  version  of  P  and  to  help  determine  whether  two  different  modifications  to 
P  interfere  [13]. 

The  main  drawback  of  program  slicing  in  recovering  high  level  information 
is  that  this  approach  is  oriented  towards  low  level  information  abstraction.    This 
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makes  it  difficult  to  understand  high  level  system  interaction  and  the  overall  system 
structure. 

2.1.3     Program  Dependence  Approach 

Program  dependencies  arise  as  the  result  of  two  separate  effects.  First,  a  de- 
pendence exists  between  two  statements  whenever  a  variable  appearing  in  one  of  the 
statements  may  have  an  incorrect  value  if  the  statements  were  reversed.  For  example, 
given 

A  =  B  *  C  —  SI 

D=A*E+1  —   S2 

S2  depends  on  SI,  since  executing  S2  before  SI  would  result  in  S2  using  an  incorrect 
value  for  A.  Dependencies  of  this  type  are  called  data  dependencies.  Second,  a 
dependence  exists  between  a  statement  and  the  predicate  whose  value  immediately 
controls  the  execution  of  that  statement.  In  the  sequence 


if (A)   then 

—   SI 

B  =  C  *  D 

—   S2 

endif 

S2  depends  on  predicate  A  since  the  value  of  A  determines  whether  S2  is  executed. 
Dependencies  of  this  type  are  called  control  dependencies. 

Program  dependence  analysis  can  be  used  in  program  slicing  and  in  identifying 
reusable  components  for  software  maintenance  [46].  Capturing  the  program  depen- 
dencies can  help  program  understanding  to  aid  modification.  On  the  other  hand, 
this  is  limited  to  specific  components  of  a  program  and  it  is  does  not  facilitate  high- 
level  understanding  of  a  program.  However,  one  significant  application  of  program 
dependence  knowledge  is  in  detection  of  parallelism  and  code  optimization  [9]. 
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2.1.4     Knowledge-based  Approach 

Knowledge-based  approaches  [49,  1,  16,  32,  4]  are  very  sophisticated  and  com- 
plicated general  methodologies  of  design  recovery.  Their  most  distinguishing  property 
as  far  as  program  understanding  is  concerned  is  that  instead  of  just  using  the  ex- 
isting code,  they  use  all  the  program  information,  including  the  source  code,  the 
documentation,  the  execution  histories,  program  analysis  results,  etc. 

The  program  information  is  expressed  as  patterns  of  program  structures,  prob- 
lem domain  structures,  language  structures,  naming  conventions,  and  so  forth.  It  is 
stored  in  a  central  knowledge  base  which  provide  frameworks  for  the  interpretation 
of  the  code. 

A  knowledge-based  system  is  not  a  single  tool,  such  as  an  editor  or  debugger. 
It  is  a  collection  of  tools  sharing  a  common  knowledge  base.  It  holds  the  promise  of 
providing  programmers  a  next- generation  programming  environment  with  the  goal 
of  dramatically  improving  the  quality  and  productivity  of  software  development  [1]. 
However,  knowledge-based  approaches  are  still  at  the  research  stage.  We  consider 
that  a  more  constrained  tool,  such  as  the  object  finder,  provides  immediate,  more 
accurate  knowledge  about  the  high-level  understanding  of  a  program,  directly  from 
its  source  code. 

2.2     Complexity  Metrics 

This  section  presents  the  complexity  metrics  considered  for  the  evaluation  of 
the  object  finder.  We  also  explain  the  rationale  for  developing  a  new  set  of  metrics 
to  be  used  in  this  evaluation. 

Software  complexity  is  defined  by  Ramamoorthy  [30]  as  "the  degree  of  difficulty 
in  analysis,  testing,  design,  and  implementation."  However,  our  notion  of  software 
complexity  is  closer  to  the  structure  complexity  metrics  [15]  which  views  the  program 
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as  a  component  of  a  larger  system  and  focus  on  the  interconnections  of  the  system 
components. 

The  software  metrics  are  classified  as  either  design  or  implementation  (code) 
metrics;  design  metrics  measure  the  complexity  of  a  design  whereas  implementation 
(code)  metrics  measure  that  of  an  implementation.  The  metrics  chosen  include  design 
metrics  such  as  the  metrics  of  modularity  [27]  (cohesion  and  coupling)  and  imple- 
mentation (code)  metrics  such  as  the  cyclomatic  complexity  [25]  of  the  program. 
Furthermore,  we  include  a  design-implementation  metrics:  the  stability  of  programs 
[48]. 

In  addition,  we  considered  code  metrics,  which  focus  on  the  individual  system 
components  (procedures  and  modules)  and  require  a  detailed  knowledge  of  their 
internal  mechanisms,  such  as  McCabe's  cyclomatic  complexity  number  [25].  We 
also  consider,  as  indicated  above,  structure  metrics,  including  Henry  and  Kafura's 
Information  Flow  [11]  metrics  and  Yau  and  Collofello's  Logical  Stability  [48]  metric. 
Each  of  these  metrics  is  measured  by  different  characteristics  of  the  program;  that  is 
to  say,  by  evaluating  the  characteristics  we  may  infer  the  degree  of  the  metrics. 

In  the  following  sections  we  explain  several  of  these  metrics  and  the  rationale 
for  discarding  each  metric. 

2.2.1     Modularity  Metrics 

Modularity  metrics  were  initially  defined  by  Myers  [27] .  They  include  two  re- 
lated aspects:  cohesion  (strength  or  binding)  and  the  coupling  of  modules.  Coupling 
is  an  indication  of  the  level  of  interconnections  between  modules.  A  software  compo- 
nent is  said  to  exhibit  a  high  degree  of  cohesion  if  the  elements  in  that  unit  exhibit 
a  high  degree  of  "functional  relatedness"  [39];  that  is  to  say,  they  exhibit  functional 
unity  [8].    The  metrics  of  coupling  among  "modules"  (either  procedures/functions 
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in  a  non-object-oriented  language  and  objects  in  the  object  finder)  are  measured  by 
"structural  fan- in/fan-out"  degrees  [6]  and  by  "informational  fan- in/fan-out"  [11]. 
Since  the  data  flow  count  includes  procedure  calls,  informational  fan-in/fan-out  sub- 
sumes structural  fan-in/fan-out  [39]. 

Information  flow  [11]  concepts  are  used  to  measure  the  module  coupling  between 
modules  in  a  software  system.  These  measures  focus  on  the  interface  which  connects 
system  components.  Myers  [27]  established  six  categories  of  coupling  based  on  the 
data  relationships  among  the  modules.  The  information  flow  metrics  can  recognize 
two  of  these  categories,  including  content  coupling  (refers  to  direct  references  between 
the  modules)  and  common  coupling  (refers  to  the  sharing  of  a  global  data  structure). 
The  information  flow  metrics  is  also  used  to  measure  the  procedure  and  module 
complexity  of  a  software  system. 

A  measure  of  the  "strength  of  the  connections  from  module  A  to  module  B"  [11] 
is: 

{PEI(A)  +  PII{B))  *  IP(A,  B) 

where  PEI(A)  is  the  number  of  procedures  exporting  information  from  module  A, 
PII(B)  is  the  number  of  procedures  importing  information  into  module  B,  and 
IP(A,  B)  is  the  number  of  information  paths  between  the  procedures.  Thus,  the 
resulting  metrics  are  a  matrix  of  the  coupling  between  any  two  modules  in  the  software 
system. 

The  information  needed  to  calculate  this  measure  of  coupling  follows: 

The  information  flow  between  modules  depends  on  the  information  flow  between 

procedures  which  are  part  of  the  module.    A  module  is  defined  with  respect  to  a 

data  structure  D  in  the  program  consisting  of  those  procedures  which  either  directly 

update  D  or  directly  retrieve  information  from  D.  Thus,  examination  of  the  global 
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flow  in  each  module  reveals  the  number  of  procedures  in  each  module  and  all  possible 
interconnections  between  the  module  procedures  and  the  data  structure. 
There  are  several  kinds  of  information  flow  between  modules: 

Definition  1  There  is  a  global  flow  of  inform,ation  from  module  A  to  module  B  through 
a  global  data  structure  D  if  A  deposits  (updates)  information  into  D  and  B  retrieves 
information  from  D. 

Definition  2  There  is  a  local  flow  of  information  from  module  A  to  module  B  if  one 
or  more  of  the  following  conditions  hold: 

1.  if  A  calls  B, 

2.  if  B  calls  A  and  A  returns  a  value  to  B,  which  B  subsequently  utilizes,  or 

3.  if  C  calls  both  A  and  B  passing  an  output  value  from  A  to  B. 

Definition  3  There  is  a  direct  local  flow  of  information  from  module  A  to  module  B 
if  condition  (1)  of  Definition  2  holds  for  a  local  flow. 

Definition  4  There  is  an  indirect  local  flow  of  information  from  module  A  to  module 
B  if  condition  (2)  or  condition  (3)  of  the  Definition  2  holds  for  a  local  flow. 

Some  examples  of  these  information  flows  are  illustrated  in  Figure  2.1.  Figure  2.1 
shows  a  simple  system  consisting  of  six  modules  (A,B,C,D,E,F),  a  data  structure  (D- 
S),  and  the  connections  among  these  components. 

The  possible  skeleton  code  for  this  system  is  shown  in  Figure  2.2.  As  indicated 
in  the  skeleton  code,  module  A  retrieves  information  from  DS  and  then  calls  B  passing 
a  parameter;  module  B  then  updates  DS.  C  calls  D,  passing  a  parameter.  Module  D 
calls  E  with  a  parameter  and  E  returns  a  value  which  D  then  uses  and  passes  to  F. 
The  function  of  F  is  to  update  DS. 
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Figure  2.1.  An  example  of  information  flow 


type_ds  DS; 

void  A() 
{ 

void  D(type  ds  p) 
{ 

type_ds  x; 

type_ds  q; 

x  =  DS  *  1; 
B(x) 

q  =  E(p); 
F(p»q); 

}  '" 

} 

void  B(type  ds 
{ 

x) 

void  E(type  ds  p) 
{ 

DS  =  x; 

return  (p  +  3) ; 

}  '" 

}  '" 

void  C(type  ds 
{ 

P) 

void  F(type  ds  p,  type  ds  q) 
{ 

D(p); 


DS  =  p  +  q; 


Figure  2.2.  Possible  skeleton  code  for  the  example 
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Table  2.1.  Information  flows  relation  generated  for  the  example. 


Direct  local  flows 

A^B 

C  -+D 

D  -+  E 

D  ->  F 

Indirect  local  flows 

E^D 

E^F 

Global  flows 

B  ->  A 

F^A 

The  information  flow  analysis  [11]  process  of  a  program  consists  of  deriving  the 
complete  flow  structure  using  a  procedure-by-procedure  analysis  of  the  program. 

The  information  flows  for  this  example  are  summarized  in  Table  2.1.  The  direct 
local  flows  are  simply  due  to  the  parameter  passing.  The  indirect  local  flows  are  due 
to  "side-effects"  relationships  between  modules.  The  first  indirect  flow,  from  E  to  D, 
results  when  E  returns  a  value  (q)  which  D  "uses"  in  its  computation.  The  second 
indirect  flow,  from  E  to  F,  results  when  other  information  (q),  that  D  receives  from  E, 
is  passed  unchanged  to  F.  Finally,  the  global  flow  is  due  to  information  flow  passing 
through  the  global  data  structure  DS. 

The  next  step  is  to  compute  the  complexity  of  a  module.  The  complexity  of  a 
module  is  the  sum  of  the  complexities  of  the  procedures  within  the  module. 

The  complexity  of  a  procedure  depends  on  two  factors:  the  complexity  of  the 
procedure  code  and  the  complexity  of  the  procedure's  connections  to  its  environment. 
A  very  simple  length  measure  was  used  as  an  index  of  procedure  code  complexity  [11]: 
the  number  of  lines  of  text  in  the  source  code  of  the  procedure  (including  embedded 
comments  but  not  those  preceding  the  procedure  statement).  The  connections  of 
a  procedure  to  its  environment  are  determined  by  the  (informational)  fan-in  and 
fan-out  of  the  procedure  defined  as  follows: 
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Definition  5  The  fan- in  of  procedure  A  is  the  number  of  local  flows  into  procedure  A 
plus  the  number  of  data  structures  from  which  procedure  A  retrieves  information. 

Definition  6  The  fan-out  of  procedure  A  is  the  number  of  local  flows  from  procedure 
A  plus  the  number  of  data  structures  which  procedure  A  updates. 

However,  in  order  to  compute  the  complexity  of  the  procedures  which  are  in- 
cluded in  a  module,  one  should  only  consider  local  flows  for  the  data  structures 
associated  with  the  module. 

Finally,  the  complexity  of  a  procedure  is: 

length  *  (fanin  *  fanout)2 

The  term  (fan  —  in  *  fan  —  out)  represents  the  total  possible  number  of  combinations 
of  an  input  source  to  an  output  destination. 

In  conclusion,  the  complexity  of  a  procedure  contained  in  a  specific  module 
is  computed  using  only  the  local  flows  for  the  data  structure  associated  with  that 
module.  The  complexity  of  a  module  is  computed  as  the  sum  of  the  complexities 
of  the  procedures  within  the  module.  We  use  these  complexity  metrics  concepts  in 
defining  a  new  set  of  complexity  measure  to  evaluate  the  object  finder. 

2.2.2     Zage's  Design  Metrics 

In  the  case  of  a  structured  design,  Zage  [51]  developed  a  design  quality  metric 
D(G)  of  the  form 

D(G)    =    h{De)  +  k2(Di) 

In  this  equation,  kx  and  k2  are  constants  and  De  and  D;  are,  respectively,  an  external 
and  an  internal  design  quality  component.  De  considers  a  module's  external  relation- 
ships to  other  modules  in  the  software  system,  whereas  Dt-  considers  factors  related 
to  the  internal  structure. 
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The  calculation  of  De  and  D;  is  performed  in  two  different  stages  of  software 
design.  De  is  based  on  information  available  during  architectural  design,  whereas  D,- 
is  calculated  after  the  detailed  design  is  completed. 

De  is  calculated  for  each  module  of  a  system  and  is  comprised  of  two  terms: 
one  product  related  to  the  amount  of  data  flowing  through  the  module  and  another 
product  giving  the  number  of  paths  through  the  module. 

De    =     [Weighted  Inflows  *  Weighted  Outflows)  +  (Fan  In  *  Fan  Out) 

De  appears  to  highlight  stress  points  in  architectural  design.  By  redesigning 
these  points,  lower  values  of  De  were  obtained  [51]  which,  in  effect,  says  that  a 
reduction  of  De  means  a  reduction  on  the  coupling  between  modules. 

The  internal  design  metrics  component  D{  is  calculated  as  follows: 

Di    =    Wl(CC)  +  w2(DSM)  +  w3(I/0) 

where 

CC         (Central  Calls)  are  procedure  or  function  invocations 

DSM     (Data  Structure  Manipulations)  are  references  to  complex  data  types 

I/O        (Input/Output)  are  external  device  accesses 

and  w\,  u>25  and  W3  are  weighting  factors. 

The  use  of  these  three  measures  (CC,  DSM,  and  I/O)  is  due  to  the  desire 
to  "choose  a  small  set  of  measures  which  would  be  easy  to  collect  and  which  would 
identify  stress  points  to  capture  the  essence  of  complexity"  [51].  Their  results  indicate 
that  stressing  the  data  structure  manipulation  usage  within  modules  gives  excellent 
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results  as  a  predictor  of  error-prone  modules.  The  proposed  values  of  the  weighting 
factors  are:  Wi  =  1,  w2  =  2.5,  and  w3  =  1.  A  was  also  better  than  cyclomatic 
complexity  v(G)  and  lines  of  code  (LOC)  as  predictor  of  error-prone  modules  [51]. 

The  design  metrics  were  used  as  guidelines  for  the  development  of  our  new  set 
of  complexity  metrics. 

2.2.3     Cyclomatic  Complexity 

The  computation  of  the  metrics  on  cyclomatic  complexity  [25]  of  a  program 
is  based  on  the  number  of  "connected"  components,  the  number  of  edges,  and  the 
number  of  nodes  in  the  program  control  graph.  In  practice,  this  metrics  for  struc- 
tured programs  consist  of  the  number  of  predicates  in  a  program  plus  one  [25] .  The 
conditionals  are  treated  as  contributing  one  for  each  predicate,  thus  we  need  to  add 
two  whenever  there  is  a  logical  "and"  and  N  —  1  whenever  there  is  a  case  statement 
with  N  cases. 

According  to  Shooman  [37],  McCabe  has  validated  this  metrics  of  cyclomatic 
complexity.  McCabe  concluded  that,  from  a  set  of  programs,  those  with  a  complex 
graph  and  a  large  v{G)  are  often  the  trouble-prone  programs. 

The  cyclomatic  complexity  of  a  program  is  derived  by  computing  the  number 
of  conditions  in  each  component  (function).  That  is,  McCabe's  cyclomatic  number 
for  a  collection  of  strongly  connected  graphs  is  determined  from  the  union  of  those 
strongly  connected  graphs.  The  cyclomatic  number  of  this  union  is  computed  to 
determine  the  cyclomatic  number  of  the  complete  program. 

The  measure  of  complexity  provided  by  these  metrics  focuses  on  individual 
system  components.  The  complexity  metrics  which  we  must  consider  should  focus 
on  the  structure  of  the  system  as  well  as  on  the  high-level  design  information  as 
criteria  to  determine  the  complexity  of  a  modularization. 
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2.2.4     Stability  Metrics 

The  metrics  on  program  stability  [48]  are  based  on  the  stability  of  a  module, 
defined  as  a  measure  of  the  resistance  to  the  potential  ripple  effect  from  a  modification 
of  the  module  on  other  modules  in  the  program.  There  are  two  sides  to  these  metrics: 
the  logical  stability  of  a  module,  defined  in  terms  of  logical  considerations,  and  the 
performance  stability  of  a  module,  defined  in  terms  of  performance  considerations. 
In  our  metrics  study,  we  concentrate  on  the  logical  stability  aspect  of  programs. 

The  logical  stability  of  programs  is  defined  in  terms  of  a  primitive  subset  of  the 
maintenance  activity,  such  as  a  change  to  a  single  variable  definition  in  a  module. 
The  incremental  approach  to  computing  the  logical  stability  metrics  [48]  of  a  program 
begins  by  computing  the  sets  of  interface  variables  which  are  affected  by  primitive 
modifications  to  variable  definitions  in  a  module  by  intramodule  change  propagation. 
In  addition,  for  each  interface  variable,  it  is  necessary  to  compute  the  set  of  modules 
which  are  involved  in  intermodule  change  propagation  as  a  consequence  of  affecting 
the  variable.  Then,  for  each  variable  definition  one  must  compute  the  set  of  modules 
which  are  involved  in  intermodule  change  propagation  as  a  consequence  of  modifying 
the  variable  definition.  Next,  the  individual  complexity  associated  with  each  module 
is  defined  using  McCabe's  cyclomatic  complexity.  The  potential  ripple  effect  of  a 
module  is  the  probability  that  a  variable  definition  will  be  chosen  for  modification 
times  the  complexity  associated  with  the  modules  affected  by  such  variable  definition. 
Finally,  the  logical  stability  of  a  program  is  the  inverse  of  this  potential  ripple  effect 
of  a  primitive  modification  to  a  variable  definition  in  a  module. 

We  argue  that  our  measure  of  complexity  is  based  on  similar  grounds  as  the  sta- 
bility metrics.  The  stability  metrics  concentrate  on  the  opposition  to  the  propagation 
of  the  effect  of  primitive  modifications  by  a  software  system.  The  propagation  occurs 
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through  the  data  transfer  in  the  system.    On  the  other  hand,  our  metrics  consider 
the  relationship  between  components  based  on  the  links  established  by  those  data 

transfers. 


CHAPTER  3 
THE  PROPOSED  APPROACH 

Our  approach  aids  software  maintenance  by  assisting  the  understanding  pro- 
cess of  the  design  of  a  software  from  its  source  code.  The  approach's  output  is  a 
modularization  of  a  program  that  consists  of  the  collection  of  "objects"  found  in  the 
program.  This  modularization  of  a  system,  that  usually  has  no  (or  little)  existing 
high  level  documentation,  gives  the  maintainers  an  understanding  of  the  structure  of 
the  system.  Since  the  approach  is  based  on  the  data  and  type  of  the  system,  it  also 
assists  the  maintainers  with  understanding  of  the  system  data. 

The  proposed  approach  consists  of  a  partial  classification  of  the  program  ele- 
ments (routines,  types,  and  data  items)  that  is  meaningful  in  the  context  of  the  target 
program  and  its  real  world  domain.  The  information  required  for  this  classification 
consists  of  the  relationships  between  those  program  elements  in  term  of  data  bindings 
and  type  bindings. 

Two  methods  of  object  identification  are  used.  The  first  is  based  on  global  and 
persistent  data  and  establishes  links  to  the  routines  that  manipulate  such  data.  The 
second  method  is  based  in  the  data  types  and  establishes  relationships  between  such 
types  and  the  routines  that  used  them  for  formal  parameters  and  return  values.  Both 
methods  result  in  sets  of  identified  candidate  objects. 

The  resulting  candidate  objects  from  both  methods  are  not  completely  disjoint 
since  they  represent  both  object  classes  and  instances  of  objects  in  the  more  classical 
sense.  In  addition,  this  allows  the  methods  to  capture  the  intentional  "violations" 
made  by  the  designer/implementor  of  the  underlying  design.  The  candidate  objects 
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represent  the  structure  of  the  program  in  terms  of  groups  of  routines  implicit  from 
the  design  and  the  relationships  between  those  groups. 

3.1     The  Proposed  Approach 

The  proposed  approach  consists  of  identifying  the  object-like  features  in  conven- 
tional programming  languages.  Object-oriented  constructs  are  not  directly  supported 
in  conventional  programming  languages;  however,  several  object-like  features,  such 
as  groupings  of  related  data  and  abstract  data  types,  are  found  in  those  programming 
languages.  The  proposed  approach  identifies  these  groupings  of  data  and  abstract 
data  types  in  terms  of  "objects."  Most  objects  are  collections  of  data,  together  with 
the  methods  needed  to  access  and  manipulate  those  data. 

An  "object,"  in  a  conventional  programming  language,  can  be  identified  as  a 
collection  of  routines,  types,  and/or  data  items.  The  routines  will  implement  the 
methods  associated  with  the  object,  the  types  will  structure  the  data  they  conceal  or 
process,  and  the  data  items  will  represent  or  point  to  actual  instances  of  the  object 
class.  Thus,  we  may  characterize  our  candidate  "objects"  as  tuples  of  three  sets: 

Candidate  Object    -     (F,  T,  D) 

where  F  is  a  set  of  routines,  T  is  a  set  of  types,  and  D  is  a  set  of  data  items.  Any  of 
these  sets  may  be  empty;  ideally  sets  from  distinct  objects  will  not  overlap  so  that  a 
routine,  type,  or  data  item  should  not  appear  in  more  than  one  object. 

A  program  contains  routines,  types,  and  data  items.  The  proposed  approach 
consists  of  a  partial  classification  of  these  elements  that  is  meaningful  in  the  context 
of  the  target  program  and  its  real  world  domain.  A  large  part  of  the  information  for 
this  classification  can  come  from  analyzing  the  relationships  between  the  components 
of  the  program,  but  human  intervention  or  very  carefully  chosen  heuristics  will  be 
needed  to  remove  coincidental  and  meaningless  relationships. 
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The  identified  candidate  objects  are  not  completely  disjoint;  there  is  some  in- 
tentional fuzziness  in  the  definition  of  candidate  objects.  It  is  often  the  case  in  a 
real  program  that  the  original  implementor  (or  a  later  maintainer!)  has  violated  the 
cleanness  of  the  underlying  design  in  a  few  instances,  either  from  laziness  or  to  gain 
efficiency.  It  would  unnecessarily  reduce  the  usefulness  of  object  finding  to  reject  out 
of  hand  any  candidates  that  had  small  overlaps  or  violations  of  good  information 
hiding  practice. 

Furthermore,  the  definition  given  does  not  distinguish  clearly  between  the  con- 
cept of  an  object  class  and  the  concept  of  an  object.  As  a  practical  matter,  in  some 
cases  it  may  be  easier  to  first  find  the  class  and  then  its  instances  and  in  other  cases 
to  reverse  this  procedure.  Thus,  it  is  more  convenient  to  treat  the  two  together. 

Two  broad  methods  of  object  finding  seem  to  be  useful.  The  first  is  based  on 
global  and  persistent  (e.g.  static  in  "C")  data  and  establishes  links  to  the  routines 
that  manipulate  such  data.  The  second  methodology  is  based  on  data  types  and 
establishes  relationships  between  such  types  and  the  routines  that  use  them  for  formal 
parameters  or  return  values.  Without  loss  generality,  we  assume  that  all  identifiers, 
such  as  routines,  variables,  and  types,  are  distinguishable  from  their  names. 

The  first  method  of  object  identification  is  given  in  Algorithm  1. 

Algorithm  1   Globals-based  Object  Finder 

Input  :  A  program  in  a  conventional  programming  language,  such  as  Ada,  COBOL, 
C,  or  FORTRAN,  with  scoping  mechanisms. 

Output  :  A  collection  of  candidate  objects. 

Steps  : 
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1.  For  each  global  variable  x  (i.e.,  a  variable  shared  by  at  least  two  routines), 
let  P(x)  be  the  set  of  routines  which  directly  uses  x. 

2.  Considering  each  P(x)  as  a  node,  construct  a  graph  G  =  (V,  E)  such  that: 

V    =     {P(x)  |  x  is  shared  by  at  least  two  routines} 


E    =    {P{x1)P(x2)\P(x1)r\P(x2)^d>}. 

3.   Construct  a  candidate  object  (F,T,D)  from  each  strongly  connected  com- 
ponent (v,e)  in  G  where 

F  =   ux{P(x)  |  P(x)  e  v} 

T    =    0 

D    =    \Jx{x  |  P(x)  e  v} 

Example  1.  Globals-based  Object  Finder — Single  Stack/Queue: 

To  motivate  this  first  method,  take  as  an  example  a  package  in  an  Ada-like 
language  for  manipulating  a  single  queue  and  a  single  stack  of  data  with  type  Element. 
The  package  provides  the  interface  routines  so  that  the  following  three  routines  access 
a  global  data  STACK: 

procedure  Push_S    (X: Element); 

—  Push  an  element  X  to  the  Stack, 
function  Pop_S   return  Element; 

—  Pop  the  top   element  from  the  Stack, 
function  Is_Empty_S  return  Boolean; 

—  Return  true   if  the   Stack   is   empty. 

and  the  following  three  routines  access  a  global  data  QUEUE: 

procedure  Push_Q  (X: Element); 

—  Push  an  element  X  to  the  Queue, 
function  Pop_Q  return  Element; 

—  Pop  the  front  element  from  the  Queue, 
function  Is_Empty_Q  return  Boolean; 

—  Return  true  if  the  Queue  is  empty. 
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If  there  is  no  other  direct  relation  between  these  two  groups,  then  clearly,  Push_S, 
Pop.S,  and  Is_Empty_S  belong  to  one  candidate  object  and  Push_Q,  Pop_Q,  and 
Is_Empty_Q  belong  to  another.  The  Globals-based  Object  Finder  given  above  would 
easily  identify  these  two  objects  as  the  following  two  tuples: 

(JFifTi,JDi)  =  ({Push_S,Pop_S,  Is_Empty_S},0,  {Stack}) 
(F2,T2,D2)     =     ({Push_Q,  Pop.Q,  Is_Empty_Q},0,  {Queue}) 

However,  this  method  in  many  cases  may  produce  objects  which  are  "too  big,"  since 
any  routine  that  uses  global  data  from  two  objects  creates  a  link  between  them.  Thus, 
a  further  stage  of  refinement  will  likely  be  necessary  in  which  human  intervention  or 
heuristically  guided  search  procedures  improve  the  candidate  objects  by  excluding 
offending  routines  or  data  items  from  the  graph  G. 

The  Globals-based  Object  Finder  could  utilize  information  other  than  the  ac- 
cesses to  global  variables  by  routines;  in  particular,  it  could  also  use  the  information 
about  references  and  definitions  of  local  variables  and  formal  parameters.  Then,  a 
more  detailed  analysis  of  the  internals  of  a  routine  will  be  required  to  obtain  that 
information  such  as  a  semantic  analysis  to  determine  whether  the  references  to  local 
variables  and  formal  parameters  in  a  routine  effectively  access  the  same  data  item 
across  invocations  of  the  routine.  Then,  the  routine  could  be  made  part  of  the  group 
of  routines  that  access  that  data. 

The  kind  of  analysis  needed  to  obtain  this  knowledge  will  be  developed  in  the 
future  study  of  this  dissertation.  One  suggestion  is  to  use  data  flow  analysis  of  the 
internals  of  routines  to  determine  the  "pattern"  of  accesses  of  local  variables  and 
formal  parameters  inside  the  routines.  A  pattern  represents  the  uses  and  definitions 
of  local  variables  and  formal  parameters  in  a  routine  similar  to  knowledge-based 
approaches  in  Section  2.1.4.    These  patterns  could  be  used  as  a  criteria  to  group 
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together  those  routines  that  exhibit  similar  patterns.  For  example,  a  pattern  could 
be  defined  as  the  use  of  a  local  variable  as  an  index  in  an  array  of  elements.  If  this 
pattern  is  identified  in  a  routine  that  has  an  array  representation  of  an  address  table 
and  in  another  routine  with  an  array  representation  of  a  symbol  table,  we  argue  that 
both  routines  should  be  part  of  the  same  table  indexing  group. 

The  second  method  of  object  identification  is  given  in  Algorithm  2. 

Algorithm  2  Types-based  Object  Finder 

Input  :  A  program  in  a  conventional  programming  language,  such  as  Ada,  COBOL, 
or  C  with  data  type  abstraction  mechanisms. 

Output  :  A  collection  of  candidate  objects. 

Steps  : 

1.  (Ordering)  Define  a  topological  order  of  all  types  in  the  program  as  follows: 

(a)  If  type  x  is  used  to  define  type  y,  then  we  say  x  is  a  "part  of"  y  and 
y   "contains"  x,  denoted  by  x  <C  y. 

(b)  x  <C  x  is  true. 

(c)  J/i  «i/  and  y  <€.  x,  then  we  say  x  is   "equivalent"  to  y,   denoted 
x  =  y. 

(d)  If  x  <C  y  and  y  <C  z,  then  x  <C  z. 

2.  (Initial  classification)  Construct  a  relationship  matrix  R(F,T)  in  which 
rows  are  routines  and  columns  are  types  of  formal  parameters  and  return 
values.  Initially,  all  entries  of  R(F,T)  are  zeroes.  An  entry  R(f,t)  is  set 
to  1  if  type  t  is  a  "part  of"  the  type  of  a  formal  parameter  or  of  a  return 
value  of  routine  f. 
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3.  (Classification  readjustment)  For  each  row  f  of  the  matrix  R,  mark  R(f,t) 
as  0  if  there  exists  any  other  type  on  the  same  row  which  "contains"  type 
t  and  has  been  marked  as  1. 

J.  (Grouping)  Collect  the  routines  into  maximal  groups  based  on  sharing  of 
types.  Specifically,  routines  /1  and  fi  are  in  the  same  group  if  there  exists 
a  type  t  such  that  R(fi,t)  =  R(f2,t)  —  1. 

5.   Construct  a  candidate  object  (F,  T,  D)  from  each  group  where 

F    =     { /  I  the  routine  f  is  a  member  of  the  group} 
T    =    {t  J  R(f,  t)  =  1  for  some  f  in  F) 
D    =    0 

Again,  in  many  cases  the  candidate  objects  created  may  be  "too  big."   As  can 
be  seen  in  the  following  example,  the  culprit  will  often  be  a  type  that  a  human  can 
easily  identify  as  irrelevant  to  the  objects  being  identified. 
Example  2.  Types-based  Object  Finder — Multiple  Stacks/Queues: 

In  this  example,  there  are  four  basic  groups  of  routines.  The  first  group  manip- 
ulates complex  numbers.  The  second  group  is  related  to  multiple  instances  of  stacks 
of  complex  numbers.  The  third  group  is  related  to  multiple  instances  of  queues  of 
complex  numbers.  The  fourth  group  involves  routines  that  manipulate  both  stacks 
and  queues.  The  four  groups  are  specified  in  algebraic  form  as  follows: 

—  Group  I : 

Construct  (Real,  Real)  =>  Complex 

—  construct  a  complex  from  two  reals 
"+"  (Complex,  Complex)  =>  Complex  —  plus 
"-"  (Complex,  Complex)  =>  Complex  —  minus 
"*"  (Complex,  Complex)  =>  Complex  —  multiplication 
"/"  (Complex,  Complex)  =>  Complex  —  division 

—  Group  II: 

Pop_S    (Stack)    =>   Stack  x  Complex 
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remove  the  top  element  from  a  stack  and  return  it 
Push.S  (Stack,  Complex)  =>  Stack 

—  push  a  complex  number  onto  a  stack 
Is_Empty_S  (Stack)  =>  Boolean 

—  return  true  if  Stack  is  empty- 
Group  III: 

Pop_Q  (Queue)  =>  Queue  x  Complex 

—  remove  the  head  element  from  a  queue  and  return  it 
Push_Q  (Queue,  Complex)  =>  Queue 

—  push  a  complex  number  onto  a  queue 
Is_Empty_Q  (Queue)  =>  Boolean 

—  return  true  if  Queue  is  empty 

Group  IV: 

Queue_to_Stack   (Queue)    =>  Stack  —  convert   a  queue  to  a  stack 

Stack_to_Queue    (Stack)    =>  Queue  —  convert   a  stack  to  a  queue 


In  this  example,  there  are  no  global  variables  used.   In  applying  Algorithm  2, 
we  will  develop  the  following  matrix  R: 


Group  ID 

Routines 

Complex 

Real 

Stack 

Queue 

Boolean 

I 

Construct 

1 

0 

ii  in 

1 

ii  —  ii 

1 

"*" 

1 

ii  /n 

1 

II 

Push_S 

0 

1 

Pop_S 

0 

1 

Is_Empty_S 

1 

1 

III 

Push.Q 

0 

1 

Pop_Q 

0 

1 

Is_Empty_Q 

1 

1 

IV 

Queue_to_ Stack 

1 

1 

Stack_to_Queue 

1 

1 

Blank  entries  (which  will  be  marked  as  O's  according  to  Algorithm  2)  indicate 
no  direct  relationship  between  the  routine  and  the  type.  "0"  means  a  "part  of" 
relationship  has  been  found  by  examining  the  internal  data  structures  of  the  program. 
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In  this  example,  at  some  point  a  complex  will  be  found  to  contain  real  values  and 
the  stack  and  queue  found  to  contain  complex  values. 

The  Types-based  Object  Finder  will  initially  classify  all  the  types  and  routines 
of  groups  II,  III,  and  IV  in  a  single  large  object  because  of  the  false  links  created  by 
the  Boolean  type  that  link  the  stack  and  the  queue.  Some  heuristics  could  be  used 
to  reduce  such  conflicts  (e.g.,  eliminate  primitive  types,  etc.),  but  it  would  also  be 
a  fairly  easy  task  for  a  user  to  intervene  at  this  point  and  identify  Complex,  Stack, 
and  Queue  as  the  objects  of  interest,  provided  that  the  data  can  be  presented  to  him 
clearly. 

There  seems,  however,  to  be  no  easy  way  to  categorize  group  IV,  which  involves 
routines  that  operate  on  both  Queue  and  Stack  objects.  A  solution  given  by  object- 
oriented  design  consists  of  a  guideline  which  requires  that  a  routine  to  be  member  of 
a  class  it  must  either  access  or  modify  data  defined  within  the  class  [18].  Clearly,  this 
indicates  that  a  routine  would  be  classified  according  to  the  type  of  its  input  formal 
parameters;  then,  routine  Queue_to_Stack  belongs  to  group  II  (the  Queue  candi- 
date object)  and  routine  Stack_to_Queue  belongs  to  group  III  (the  Stack  candidate 
object). 

Excluding  group  IV  and  type  Boolean,  the  objects  identified  by  Algorithm  2 
would  be  listed  as  follows: 

(FuTuDi)     =     ({Construct,  "+■',  "-",  "*",  "/"},  {Complex},  0) 
(F2,T2,D2)     =     ({Push_S,  Pop_S,  Is_Empty_S},  {Stack}, 0) 
(F3iTa,Da)     =    ({Push_Q,Pop_Q,  Is_Empty_Q},  {Queue},  0) 

The  previously  described  topological  ordering  of  types,  in  Step  1  of  Algorithm  2, 
is  appropriate  whenever  all  types  in  a  program  are  related  in  terms  of  the  "part  of" 
relationship  defined  above.    This  ordering  represents  the  fact  that  a  given  type  is 
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"part  of"  another  type;  thus,  the  latter  type  is  "more  important"  than  the  former 
type  when  classifying  routines  into  candidate  objects.  That  is  to  say,  a  routine  with 
formal  parameters  or  return  values  with  multiple  types  should  be  classified  as  part  of 
the  group  of  routines  that  manipulate  data  with  the  most  important  of  those  types, 
i.e.,  the  type  that  was  defined  using  the  other  types  of  the  data  manipulated  by  the 
routine. 

The  main  problem  with  this  type  ordering  scheme  occurs  when  some  types 
are  not  related  by  the  "part  of"  relationship.  In  this  case,  the  previous  topological 
ordering  of  types  does  not  completely  characterize  the  relative  importance  of  all  the 
types;  thus,  the  classification  of  routines  using  this  type  ordering  scheme  results  in 
routines  which  may  be  classified  under  more  than  one  type,  such  as  the  routines  in 
group  IV  of  Example  2.  This  problem  can  be  handled  by  an  alternative  type  ordering 
scheme  based  on  the  "complexity"  of  the  data  types  in  a  system. 

The  relative  complexity  type  ordering  consists  of  ordering  all  the  types  in  the 
program  based  on  the  "complexity"  of  the  types.  This  complexity  is  expressed  by  a 
complexity  index  function,  called  CI,  for  a  given  type.  Assume  that  the  types  in  a 
system  are  represented  using  a  tree  that  captures  the  "part  of"  relationship  between 
types.  Then,  if  type  y  is  a  "part  of"  type  x,  type  x  is  an  ancestor  of  type  y  and  type 
y  is  a  descendent  of  type  x  in  the  tree.  An  example  of  trees  representing  structure 
types  is  given  in  Figure  3.1  of  Example  3.  The  complexity  index  of  type  t,  denoted 
CI(t,0),  is  computed  using  the  complexity  index  function  in  Table  3.1. 

Given  that  the  type  of  interest  is  the  root  of  a  tree  representation  of  the  type,  the 
complexity  index  function  recursively  computes  the  complexity  of  the  type  as  the  sum 
of  the  path  lengths  (in  number  of  arcs)  from  the  root  type  to  all  its  descendent  types  in 
the  tree.  In  Table  3.1,  the  added  complexity  by  primitive  types  is  simply  the  length  of 
the  path,  denoted  d,  from  the  root  type  to  the  primitive  type.  The  complexity  due  to 
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Table  3.1.  Complexity  index  function  for  types  in  the  "C"  programming  language. 


type* 

CI{t,d) 

primitive 

d 

pointer  to  primitive  type  y 
or  user-defined  type  y 

d  +  (drf*CI(y,d+l)) 

array  of  primitive  type  y 
or  user-defined  type  y 

d  +  (dimension(t)  *  CI(y,  d  +  1)) 

struct  t  {  fl,   f2,    . . . ,   f n} 

and  f  1,  f  2,  . . .,  fn  are  all  base  field  types 

d  +  J2UCI(fj,d+l) 

struct  t   {  fl,    . . . ,   fk,    ...,   fn} 

and  f  1,  . . .,  fk  are  base  field  types 

and  f k+1,  . . .,  fn  are  recursive  field  types 

d+S+f*S 
where 

f  =  (n  -  k)/n 

pointer  types  is  modified  by  the  dereferencing  factor,  called  drf,  which  represents  the 
pointer's  complexity  added  to  the  complexity  of  the  type;  currently,  the  dereferencing 
factor  may  fluctuate  between  1.5  and  2.0  depending  on  the  system's  use  of  pointer 
types.  The  complexity  due  to  an  array  type,  t,  is  modified  by  the  number  of  elements 
in  the  array,  i.e.,  its  dimension(t).  The  complexity  due  to  a  structure  type  consists 
of  the  sum  of  the  complexity  of  its  base  fields  types.  The  field  types  of  a  structure 
are  of  two  categories:  recursive  field  types  are  those  which  consists  of  either  a  pointer 
to  the  containing  structure  or  essentially  the  same  structure  type  as  the  containing 
structure  (in  "C",  a  typedef  type);  base  field  types  are  those  which  are  not  recursive 
types.  In  the  presence  of  recursive  field  types,  the  complexity  due  to  an  structure 
type  consists  of  the  sum  of  the  complexities  of  the  base  field  types,  denoted  by  S, 
plus  a  fraction  of  this  complexity  caused  by  the  recursive  field  types. 

Two  types  could  be  easily  ordered  by  comparing  their  complexity  indexes  where 
a  type  A  is  more  complex  than  type  B  iff  CI(B)  <  CI  (A).  Hence,  type  A  is  "more 
important"  than  type  B  in  the  topological  ordering  of  types.   A  partial  ordering  of 
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D  E  N  B. 

D  'E 

Figure  3.1.  Tree  representation  of  types  for  complexity  index  function 

the  types  in  a  program  is  established  by  comparing  the  complexity  indexes  of  all  the 

types.  Example  3  illustrates  the  computation  of  the  complexity  indexes. 

Example  3.  Type  ordering  based  on  the  type  complexity  index  function: 

Consider  two  "C"  data  structures  below  where  an  identifier  in  capital  letters 

denotes  a  type  in  the  language  and  one  in  lowercase  letters  denotes  a  data  structure 

field  name.  Assume  that  types  A,  B,  K,  and  L  are  structure  types  and  types  C,  D,  E, 

M,  and  N  are  primitive  types  in  the  "C"  programming  language  as  follows: 

struct  A  {  struct  K  { 

C  c;  struct  L  { 

struct  B  {  struct  B  { 

D  d;  Dd; 

E  e;  E  e; 

};  }; 

>;  N  n; 

}; 

M  m; 

}; 

The  tree  representations  of  those  structures  types,  which  captures  the  "part  of" 
relationship  among  types,  are  shown  in  Figure  3.1.  Then, 

•  The  complexity  indexes  of  types  A  and  B  in  structure  A,  according  to  the  com- 
plexity index  function  in  Table  3.1,  are: 


CI(A,0)    =    0  +  CI(C,l)  +  CI(B,l) 
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=  0  +  l  +  (l  +  C/(D,  2)  +  CI{E,2)) 

=  0  +  1 +  (1  +  2  +  2)) 

=  6 

CI(B,0)    =  0  +  CI{D,l)  +  CI(E,l) 

=  0+1+1 

=  2 


Type  A  is  more  complex  than  type  B,  thus  type  A  is  higher  in  the  ordering  of 
types  than  type  B. 


The  complexity  index  of  type  K  is 


CI(K,0)    =  0  +  CI(M,l)  +  CI(L,l) 

=  0  +  l  +  (l  +  CI(N,2)  +  CI{B,2)) 

=  0  +  l  +  (l+2  +  (2  +  C7(L>,3)  +  C7(£,3))) 

-  0  +  1 +  (1  +  2  + (2  +  3  +  3)) 

=  12 


Then,  type  K  is  more  complex  than  type  A.  This  type  ordering  scheme  allows 
the  comparison  of  two  unrelated  types  to  determine  the  most  complex,  and 
most  important,  of  the  types. 

•  The  complexity  indexes  of  type  B  by  itself  and  within  structures  A  and  L,  are 
equal  since  the  complexity  index  depends  on  the  structure  that  the  type  repre- 
sents which  is  the  same  in  all  three  cases. 

An  additional  problem,  which  is  not  addressed  by  either  of  the  two  type  ordering 
schemes,  occurs  when  fields  of  an  structure  are  conceptually  "more  important"  than 
the  structure,  thus,  the  routines  could  be  classified  into  objects  according  to  the 
former.  For  example,  consider  another  routine  Add_Top_to_Operand  in  the  multiple 
stacks/queues  (Example  2)  which  implements  the  following  functionality: 
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Complex  Add_Top_to_Operand   (Stack:    st,   Complex:    op) 

—  add  the  top  element  from  a  stack  to  another  complex  and  return  result 
Complex:    result,    top; 

top  =  Pop_S(st) ; 
result   =   "+"(top,op); 
return(result) ; 


Using  either  of  the  two  type  ordering  schemes,  routine  Add_Top_to_Operand 
would  be  classified  under  the  Stack  object.  However,  the  functionality  of  the  routine 
indicates  that  it  should  be  classified  under  the  Complex  object  since  the  routine 
implements  operations  on  Complex  entities  of  the  program.  The  functionality  of 
routines  could  be  captured  by  a  data  flow  analysis  of  the  internals  of  a  routine  and 
by  examination  of  the  types  of  variables  accessed  inside  the  routines  (including  global 
variables,  local  variables,  and  formal  parameters). 

3.2     Applicability  of  the  Algorithms 

The  applicability  of  these  algorithms  for  object  identification  includes  the  kind 
of  conventional,  procedural  programming  language  such  as  "Ada,"  "C,"  "COBOL," 
or  "Pascal." 

The  Globals-based  Object  Finder  handles  those  conventional  programming  lan- 
guages as  well  as  "FORTRAN."  Procedural  programming  languages  provide  static  and 
dynamic  scoping  mechanisms  which  allow  the  definition  of  global  variables  as  well  as 
local  variables  with  respect  to  a  particular  scope  level.  The  execution  of  the  body 
of  a  function  in  the  case  of  these  programming  languages  results  in  side  effects  being 
observed  on  the  values  of  the  global  variables  [24] . 

The  Types-based  Object  Finder  handles  procedural  programming  languages 
that  provide  a  data  abstraction  mechanism  which  permits  the  construction  of  com- 
posite types  using  more  primitive  types.  Clearly,  typing  mechanisms  are  also  required 
for  this  algorithm.    "FORTRAN"  is  one  of  these  programming  languages  which  does 


38 


not  provide  explicit  type  construction  mechanisms,  with  the  exception  of  arrays. 
The  limitation  of  applicative  languages,  such  as  "LlSP",  is  that  they  are  not  explic- 
itly typed;  in  addition,  their  (abstract)  data  type  construction  mechanisms  (e.g.,  list 
structure)  are  not  currently  handled  by  our  analysis  approach. 

3.3     Conditions  for  Best  Results 

The  proposed  modularization  approaches  are  particularly  useful  when  the  fol- 
lowing conditions  hold: 

1.  The  program  being  maintained  is  written  in  a  programming  language  that 
supports  object-like  features  such  as  grouping  of  related  data  and  abstract 
data  types.  Implicit  abstract  data  types  are  identified  by  the  approach  of 
object  identification.  In  the  case  of  programming  languages  that  explicitly 
support  the  syntactic  specification  of  abstract  data  types,  this  modularization 
is  used  to  define  the  relationships  between  the  abstract  data  types  defined  in 
the  system. 

2.  The  Globals-based  Object  Finder  requires  that  the  program  being  maintained 
supports  a  static  scoping  mechanism.  This  allows  the  definition  of  global  vari- 
ables and  the  occurrence  of  side  effects  by  the  invocation  of  routines  referencing 
the  global  variables. 

3.  The  Types-based  Object  Finder  requires  that  the  program  being  maintained 
supports  a  data  type  abstraction  mechanism  that  permits  the  construction  of 
composite  types  using  more  primitive  types. 


CHAPTER  4 
TIME  AND  SPACE  COMPLEXITY  ANALYSIS 

In  this  chapter,  the  time  and  space  complexities  of  the  proposed  approach  are 
discussed.  These  complexities  are  independently  analyzed  for  the  two  approaches  of 
object  identification. 

4.1     Algorithm  1.  Globals-based  Object  Finder  Complexity 

The  time  and  space  complexity  of  this  algorithm  follows: 

Step  1.  Build  set  P(x)  for  each  global  variable  x:  Let  N  be  the  number  of  rou- 
tines and  g  be  the  number  of  global  variables  in  a  system.  Assume  that  the  input 
to  step  1  is  a  symbol  table  representation  of  a  program.  This  symbol  table  consists 
of  a  sorted  list  of  entries  each  of  which  contains  an  identifier's  definition  and  refer- 
ences information.  The  time  required  to  look  up  the  references  information  about  an 
identifier  from  the  symbol  table  of  a  program  is  linearly  proportional  to  the  number 
of  identifiers  in  the  program;  the  time  required  is  at  most  0(g  +  N).  Thus,  for  each 
global  variable  x,  the  time  required  to  build  the  unordered  set  P(x)  of  routines  which 
directly  access  x  is  0(g  +  N).  For  g  global  variables,  the  total  time  complexity  of 
this  step  is  0(g(g  +  N)).  For  real  programs,  TV  is  usually  larger  than  g,  and  the  time 
complexity  is  O(gN). 

The  space  requirement  for  this  step  consists  of  the  space  to  store  the  sets  P(x); 
the  maximum  size  of  a  set  P(x)  is  O(N).  For  g  global  variables,  the  space  require- 
ments is  O(gN). 
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Step  2.  Construct  graph  G:  Graph  G  construction  consists  of  making  each  set 
P(x)  to  be  a  node  in  the  graph;  the  edges  of  the  graph  consists  of  the  set  intersection 
between  two  nodes  (only  its  magnitude  is  important). 

The  data  structure  to  store  graph  G  consists  of  lists  which  represent  the  edges 
of  the  graph  as  follows:  an  edge  (gM  gJ)2  P{g-h)  H  P(gJ)2))  where  gJ)X  and  gJ)2 
are  global  variables  and  P{gJb\)  H  P(gJ)2)  is  the  set  of  common  routines  which  is 
also  represented  as  a  list.  The  time  required  to  construct  one  edge  is  the  time  re- 
quired to  obtain  the  intersection  of  unordered  sets  P(g.bi)  and  P(gJ)2).  The  current 
implementation  does  not  presort  sets  P(x)  before  an  intersection  operation.  Thus, 
the  time  is  proportional  to  \P(g-bi)\\P(g-b2)\  and  the  maximum  time  is  0(N2).  An 
improvement  consists  of  presorting  the  sets  P(x).  This  improvement  is  possible  since 
the  kind  of  maintenance  tasks  (design  recovery)  which  we  consider  do  not  involve 
changes  to  the  connectivity  of  a  program;  then,  the  set  P(x),  for  global  variable  x, 
will  remain  unchanged  after  a  modification.  In  this  case,  the  time  required  to  perform 
an  intersection  would  be  O(N). 

The  total  time  required  to  construct  all  the  edges  is  proportional  to  the  number 
of  global  variables,  g,  in  the  program.  The  maximum  number  of  edges  is  0(g2). 
Hence,  the  total  time  complexity  is  0(g2N2). 

Step  3.  Construct  candidate  objects  from  strongly  connected  components:  These 
candidate  objects  are  obtained  using  a  depth-first  search  algorithm  [10,  42]  for  deter- 
mining the  connected  components  of  the  graph  G;  the  complexity  of  such  an  algorithm 
is  O(M),  where  M  is  the  number  of  edges  in  the  graph  which  in  turn  is  bound  by 
the  number  of  nodes  in  the  graph  as  0(g2).  This  algorithm  starts  with  some  node  of 
graph  G.  Then,  we  visit  all  the  connected  nodes  in  the  order  of  a  depth-first  search, 
i.e.,  we  walk,  first,  as  far  as  possible  into  the  graph  without  forming  a  cycle,  and  then 
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we  return  to  the  last  bifurcation  which  had  been  ignored,  and  so  on  until  we  return 
to  the  node  from  which  we  started.  We  restart  the  procedure  just  described  from 
some  new  node  until  all  nodes  have  been  visited. 

Based  on  this  analysis,  the  time  complexity  of  the  Globals-based  Object  Finder 
is  0(gN  +  g2N2  +  g2).  The  bounding  time  complexity  is  0(g2N2). 

The  space  complexity  of  this  algorithm  is  also  proportional  to  the  space  required 
to  save  graph  G.  The  space  required  to  save  all  the  nodes  in  the  graph  is  clearly 
proportional  to  the  number  of  nodes  (i.e.,  number  of  global  variables)  and  the  number 
of  routines  in  the  set  P(x)  associated  with  a  node.  Since  the  maximum  number  of 
routines  in  a  node  is  N,  this  space  complexity  is  O(gN).  The  space  required  to  save 
all  the  edges  in  the  graph  is  bound  by  the  number  of  nodes  and  the  intersection 
set  corresponding  to  an  edge,  i.e.,  0(g2N2).  Then,  the  total  space  complexity  is 
0(gN  +  g2N2). 

4.2     Algorithm  2.  Types-based  Object  Finder  Complexity 

The  time  and  space  complexity  of  this  algorithm  only  considers  the  type  order- 
ing scheme  based  on  the  "part  of"  relationship  between  types;  it  follows: 

Step  1.  Type  ordering:  The  definition  of  a  topological  ordering  is  required  for 
all  the  (abstract  data)  types  used  as  types  of  formal  parameters  or  return  values  of  the 
routines  in  a  program.  The  topological  ordering  of  types  defines  an  ordering  of  all  the 
types  in  the  program  according  to  the  "part  of"  relationship  between  any  two  types; 
i.e.,  if  type  ti  is  used  to  define  type  t2,  then  we  say  that  t\  is  a  "part  of"  t2.  Assume 
the  number  of  types  used  in  a  program,  T,  is  proportional  to  the  size  of  the  program 
in  lines  of  code  L1.  One  algorithm  for  topological  sort  [38]  has  time  complexity  of 
0(n2)  where  n  is  the  number  of  vertices  in  the  graph.   In  our  analysis,  the  number 


1For  real  programs,  T  will  be  usually  smaller  than  L. 
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of  vertices  is  T.  Then,  the  time  complexity,  using  this  topological  sort,  is  bound 
by  0(T2).  Another  algorithm  for  topological  sort  [17]  has  a  total  time  complexity 
bounded  by  (32m  +  24n  +  7b  +  2c  +  16a)  where  m  is  the  number  of  input  relations 
between  types,  n  is  the  number  of  objects,  a  is  the  number  of  objects  without  no 
predecessors  (primitive  types),  b  is  the  number  of  tape  records  in  input,  and  c  is  the 
number  of  tape  records  in  output. 

The  topological  ordering  of  types  is  stored  as  a  tree  which  represents  the  "part 
of"  relationship  between  types  in  a  program.  The  "part  of"  relationship,  between 
two  types  x  and  y,  is  stored  as  lists  (x  y)  where  type  y  is  "part  of"  type  x  and 
type  x  "contains"  type  y.  Given  a  list  (t%  t2  ...  tn),  then  type  t\  "contains"  types 
t2  through  tn.  The  maximum  space  required  to  save  this  tree  is  0(T2).  This  tree  of 
types  usually  has  a  maximum  of  three  or  four  levels  in  its  branches. 

Step  2.  Initial  classification:  The  time  required  to  construct  matrix  R  with  N 
routines  and  T  (abstract  data)  types  is  0(TN).  For  real  programs,  T  is  usually 
smaller  than  N.  Thus,  the  time  required  to  construct  matrix  R  is  bound  by  0(N). 

Step  3.  Classification  readjustment:  The  time  required  for  the  classification 
readjustment  is  proportional  to  the  number  of  routines,  N,  and  the  number  of  types, 
T,  in  matrix  R.  For  a  given  routine  in  row  r  which  has  been  marked  with  1  in  type  t, 
the  time  required  to  determine  whether  there  exists  any  other  type  s  on  the  same  row 
r  which  "contains"  type  t  and  has  also  been  marked  with  1,  is  proportional  to  the 
number  of  types  T  and  to  the  time  required  to  determine  whether  type  s  "contains" 
type  t.  The  latter  time  is  proportional  to  the  number  of  types  in  the  type  ordering 
tree,  and  it  is  equivalent  to  the  time  to  search  for  a  path  from  type  t  to  type  s  in  the 
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subtree  of  the  type  ordering  tree  with  root  equal  to  s.  The  algorithm  used  to  search 
for  this  path  is  a  depth-first  search  [47]  which  has  constant  time  complexity  [38]. 

Hence,  for  a  routine,  the  time  required  for  the  classification  readjustment  is 
0(T).  Given  that  we  have  N  routines,  the  time  required  for  the  classification  read- 
justment of  matrix  R  is  O(NT). 

Step  4.  Grouping:  The  candidate  objects  are  formed  by  collecting  all  routines, 
r,  that  share  a  type  t;  this  is  to  say,  for  a  given  type  in  column  t,  a  candidate  object  of 
routines,  sharing  type  t,  consists  of  all  the  routines,  in  row  r,  with  (column)  type  t  set 
to  1  after  step  3.  The  time  required  to  form  these  candidate  objects  is  proportional  to 
the  number  of  (abstract  data)  types  T  and  the  number  of  routines  iV  in  the  system. 
The  time  complexity  of  this  step  is  0(TN). 

Based  on  this  analysis,  the  time  complexity  of  the  Types-based  Object  Finder 
is  0(T2  +  N  +  NT  +  NT).  As  indicated  above,  for  real  programs  T  is  usually  small 
with  respect  to  iV,  and  the  bounding  time  complexity  is  0(max(T2,NT)). 

The  space  complexity  of  this  algorithm  is  clearly  the  space  required  to  store 
matrix  R  plus  the  space  required  to  store  the  tree  representing  the  types  ordering, 
i.e.,  0(NT  +  T2). 


CHAPTER  5 
EVALUATION  OF  THE  APPROACH 

This  chapter  presents  some  guidelines  for  an  evaluation  of  the  object  identifi- 
cation methods.  The  goals  of  this  evaluation  are  to  compare  the  algorithms  used  for 
identifying  objects  with  other  existing  modularization  techniques  in  term  of  the  com- 
plexity of  the  resulting  modularization  and  to  compare  the  resulting  modularization 
in  a  conventional,  procedural  programming  language  with  the  "classes"  found  in  an 
object-oriented  version  of  the  program. 

The  evaluation  of  the  object  finder  algorithms  consists  of  careful  examination 
of  the  results  of  this  approach.  Two  studies  were  used  to  evaluate  the  identified 
objects.  In  the  first  study,  named  study  I,  we  compared  the  groups  (based  on  the 
identified  candidate  objects)  identified  by  the  object  finder  with  those  groups  (based 
on  the  clusters)  identified  with  hierarchical  clustering  [14].  The  comparison  was 
based  on  the  complexity  of  these  two  partitionings  resulting  from  each  analysis.  The 
results  of  this  study  are  presented  in  this  chapter.  In  the  second  study,  named 
study  II,  we  compared  the  identified  objects  found  in  a  program  with  the  object- 
oriented  programming  classes  found  in  the  object-oriented  version  of  the  program. 
We  explained  the  results  of  study  II  in  Section  8.2. 

A  system's  partitioning  results  from  the  system  modularization,  i.e.,  the  group- 
ing of  routines  into  disjoint  groups.  The  object  finder  and  the  hierarchical  clustering 
technique  [14]  are  different  methods  to  obtain  these  partitionings. 

Metrics  of  the  complexity  of  software  structures  have  been  used  as  valuable 
management  aids,  important  design  tools,  and  as  basic  elements  in  research  efforts 
to  establish  a  quantitative  foundation  for  comparing  language  constructs  and  design 
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methodologies  [11].  In  addition,  module  and  interface  metrics  have  been  used  in 
evaluating  the  modularization  and  the  level  of  coupling  of  a  system  [14].  In  study  I, 
we  continue  to  use  metrics  to  evaluate  the  complexity  of  a  system  partitioning  in  terms 
of  the  complexity  of  the  interfaces  and  the  complexity  of  its  components.  These  two 
views  of  complexity  parallel  coupling  and  cohesion.  Conte  et  al.  [7],  measure  coupling 
as  the  number  of  interconnections  among  modules  and  cohesion  as  the  measure  of 
the  relationships  of  the  elements  within  a  module.  A  design  with  high  degree  of 
coupling  and  low  module  cohesion  will  contain  more  errors  than  a  design  with  low 
module  coupling  and  high  module  cohesion.  Several  primitive  complexity  metrics  can 
be  used  to  quantify  the  coupling  and  cohesion  of  a  module.  Then,  several  factors  are 
used  to  quantify  our  complexity  metrics. 

5.1     Goals  of  the  Evaluation  Studies 

The  primary  goal  of  the  evaluation  studies  was  to  demonstrate  that  the  algo- 
rithms used  for  identifying  objects  result  in  system  partitionings  less  complex  than 
other  existing  modularization  methods.  This  chapter  also  includes  the  development 
of  a  new  set  of  primitive  factors  that  measured  the  complexity  of  a  modularization 
of  a  system  by  characterizing  the  complexity  of  the  corresponding  partitioning. 

The  evaluation  consisted  of  comparing  the  complexity  of  the  candidate  objects 
identified  by  the  object  finder  and  the  complexity  of  the  clusters  defined  using  Basil- 
i's  hierarchical  clustering  technique  [14].  Another  motivation  for  comparing  the  two 
approaches  was  to  determine  whether  the  object  finder  approach  results  effectively 
capture  the  structure  of  a  software  system.  Since  Basili's  hierarchical  clusters  repre- 
sent an  experimentally  validated  initial  approximation  of  the  groups  intended  by  the 
designer  of  the  software  system  [14];  then,  the  results  of  the  object  finder  should  be 
less,  or  at  least  equally,  complex  than  the  hierarchical  clustering  approach  results. 
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The  chosen  criteria  to  make  the  comparison  was  the  complexity  of  each  parti- 
tioning measured  by  complexity  metrics  similar  to  coupling  and  cohesion.  According 
to  Hutchens  and  Basili  [14],  the  question  of  strength  and  coupling  between  elements 
of  a  given  partitioning  is  largely  still  unresolved.  Thus,  a  new  set  of  primitive  factors 
was  developed  to  measure  the  complexity  of  the  relationships  between  modules  of 
a  software  system  as  well  as  the  complexity  of  the  relationships  inside  the  modules. 
Furthermore,  a  measure  of  module  strength  in  terms  of  the  uniqueness  of  the  types 
manipulated  by  the  module  was  developed.  That  is  to  say,  the  new  set  of  factors 
when  applied  to  the  partitioning  of  a  system,  measures  the  complexity  of  the  interface 
between  the  system  modules,  the  complexity  inside  the  modules,  and  the  strength  of 
the  components  inside  a  module. 

Other  metrics  [27,  25,  6,  11,  51,  48,  29]  were  considered  for  computing  the 
complexity  of  the  partitioning.  They  were  not  suitable  for  this  purpose  since  the 
characteristics  to  be  measured  corresponds  to  the  complexity  of  the  interface  (simi- 
lar to  coupling)  and  to  the  complexity  of  the  relationships  between  elements  inside 
a  module  (similar  to  cohesion).  The  object  finder  approach  assumes  that  we  iden- 
tify candidate  objects  in  programs  which  were  originally  designed  without  explicit 
object-oriented  syntactic  features  even  though  we  assumed  that  object-like  features 
are  present  in  the  programs.  Thus,  metrics  based  on  object-oriented  syntactic  fea- 
tures could  not  be  used  in  the  evaluation.  Hence,  we  created  a  new  set  of  primitive 
complexity  metrics  factors  which  measured  three  aspects  of  complexity:  factors  that 
measured  the  complexity  of  the  interfaces  between  modules  in  a  partitioning;  fac- 
tors that  measured  the  complexity  of  the  relationships  between  components  inside 
modules,  in  terms  of  interactions  between  components  in  the  modules;  and  primitive 
metrics  factors  that  measured  the  strength  of  the  relationships  between  components 
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in  a  module  in  terms  of  the  similarity  of  data  types  manipulated  by  the  components 
of  the  module. 

5.2     Methodology  of  the  Evaluation  Studies 

The  methodology  of  the  evaluation  of  the  object  finder  included  two  studies:  (1) 
study  I  consisted  of  comparing  the  groups  identified  by  the  object  finder  (candidate 
objects)  with  those  groups  identified  by  hierarchical  clustering  [14]  (clusters),  and  (2) 
study  II  compared  the  identified  candidate  objects  in  a  program  with  those  classes 
found  in  the  object-oriented  version  of  the  same  program. 

Several  other  evaluation  studies  are  proposed  for  the  future  research  including: 

•  Use  of  experts  to  compare  the  identified  candidate  objects  in  a  system  with  the 
expert  knowledge  about  the  system. 

•  Use  student  academic  projects  as  well  as  industrial- size  programs  to  further 
evaluate  the  object  finder  algorithms. 

The  steps  in  study  I  were: 

1.  Identify  example  programs  for  the  study.  Three  sample  programs  were  identified 
for  this  study  from  the  literature.  Another  industrial- size  software  system  was 
considered  for  future  evaluation  studies. 

2.  Compute  the  identified  candidate  objects  using  the  Top-down  Analysis  (Method 
1  in  Chapter  6)  for  a  sample  program.  The  result  of  this  method  is  a  partition- 
ing of  the  system  that  consists  of  groups  which  corresponds  to  the  identified 
candidate  objects. 

3.  Compute  the  clusters  in  the  sample  program  using  Basili's  hierarchical  cluster- 
ing technique  [14].  The  result  of  this  technique  is  a  partitioning  of  the  system 
which  normally  corresponds  to  the  top-level  clusters  of  the  system. 
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4.  Compute  the  primitive  complexity  metrics  factors  for  each  partitioning  above. 
In  this  step,  we  determined  the  primitive  metrics  factors  for  a  partitioning  of 
the  system  according  to  the  definitions  in  Section  5.3. 

5.  Compare  the  complexity  of  the  two  partitionings  using  the  results  of  the  prim- 
itive complexity  metrics  factors  for  each  partitioning. 

The  steps  in  study  II  are: 

1.  Identify  example  programs  for  the  study.  One  program  was  chosen  for  this 
study  from  the  literature. 

2.  Conversion  of  object-oriented  code  into  non  object-oriented  code.  For  the  com- 
parison of  identified  objects  and  classes  of  an  object-oriented  version  of  a  pro- 
gram, we  require  a  program  with  two  versions  including  an  object-oriented 
version  and  an  equivalent  non  object-oriented  version.  We  choose  an  object- 
oriented  program  from  the  literature.  Then,  we  derived  a  functionally  equiva- 
lent version  of  this  program  using  non  object-oriented  techniques. 

3.  Compare  the  identified  objects  with  the  classes. 

The  instrument  and  results  of  study  II  are  reported  in  Section  8.2. 

5.3     Primitive  Metrics  of  Complexity 

Assume  that  a  system  contains  a  large  number  of  routines  and  data  structures. 
Our  complexity  metrics  factors  measure  the  ability  to  partition  [3]  the  system  as  to 
absorb  as  many  relations  between  routines  in  a  group  as  possible  and  thus  leave  few 
inter-group  connections  which  results  in  less  complex  group  interfaces.  In  addition, 
the  complexity  metrics  measure  the  ability  of  reducing  the  number  of  relations  be- 
tween routines  inside  a  group  which  results  in  less  complex  relations  inside  groups. 
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That  is  to  say,  we  categorize  what  makes  one  partition  "better"  than  another  in  spite 
of  the  fact  that  the  connectivity  of  routines  is  always  the  same  and  only  their  group 
assignment  changes  from  partitioning  to  partitioning.  Beladi  and  Evangelisti  define 
the  degree  of  connectivity  of  a  cluster  as  the  number  of  "connections"  between  the 
elements  of  the  cluster.  Program  routines  and  data  structures  are  interconnected  by 
routines  invocations  and  references  in  software  systems  [3]. 

In  the  absence  of  a  direct  measure  of  the  inter-group  complexity  and  intra-group 
complexity  which  given  a  partitioning  would  compute  these  metrics,  we  developed 
the  set  of  primitive  complexity  metrics  factors  which,  we  argue,  measure  inter-group 
and  intra-group  complexities,  as  well  as  intra-group  strength,  as  demonstrated  in  Sec- 
tion 5.3.5.  The  union  of  all  primitive  metrics  factors  related  to  one  kind  of  complexity 
(inter-group  complexity,  intra-group  complexity,  or  intra-group  strength)  quantifies 
the  corresponding  complexity. 

The  following  sections  present  important  definitions  and  examples,  the  primitive 
factors  of  complexity,  and  a  validation  of  these  factors. 

5.3.1     Definitions 

Assume  that  all  variables  in  a  program  have  unique  names.  A  use  of  a  vari- 
able refers  to  either  a  definition  (a  value  is  assigned  to  a  variable)  or  a  reference  (a 
variable's  value  is  used)  of  the  variable  [12]. 

Definition  7  A  global  variable  is  a  variable  directly  used  within  at  least  two  modules. 

Definition  8  Let  P  be  a  module.  There  is  a  module- global  access  pair  (P,g)  if  g  is  a 
global  variable  used  within  P. 

Synopsis:  A  module-global  access  pair  (M,g)  represents  the  fact  that  global 
variable  g  is  used  within  module  M . 
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Example  1  Module-global  access  pair  (M,g) 

M() 
{ 
//  g  is  used  within  M 

} 


Definition  9  Let  (Q,g)  be  a  module-global  access  pair.  There  is  a  module- global 
indirect  access  pair  (P,g)  if  3  a  module  Q  with  a  call  to  P,  P(...,g,.. .)  such  that 
the  formal  parameter  of  P  corresponding  to  g  is  used  within  P. 

Synopsis:  A  module-global  indirect  access  pair  (M,g)  represents  the  fact  that 
global  variable  g  is  indirectly  used  by  module  M  when  there  is  a  call  to  M  with 
formal  parameter  g  in  module  N  which  uses  g. 

Example  2  Module-global  indirect  access  pair  (M,g) 

N()  M(a) 

{  { 

M(g)  //  a  is  used  within  M 

}  > 

Definition  10  Let  P,  Q  be  two  modules  and  g  be  a  global  variable  in  P  and  Q.  There 
exists  a  module-global  access  data  binding  triple  (P,g,Q)  if  g  is  defined  within  P 
and  referenced  within  Q. 

Synopsis:   A  module-global  access  data  binding  triple  (P,g,Q)  represents  the 

fact  that  module  P  defines  the  value  of  global  variable  g  and  module  Q  references 

the  value  of  global  variable  g.    It  reflects  the  data  relationship  between  modules  P 

and  Q,  and  the  direction  of  the  information  flow  (from  P  to  Q). 

Example  3  Module-global  access  data  binding  triple  (P,g,Q) 

PC)  QO 

{  { 

g<-  //  is  defined    <-g  //  is  referenced 
}  > 
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Definition  11  Let  P,   Q  be  two  modules  and  x  be  a  local  variable  in  P.    There  is  a 
local-export  data  binding    triple  (P,  x,  Q)  if 

1.  x  is  defined  within  P  before  a  call  Q(. . . ,  x, . . .)  in  P ,  and 

2.  the  formal  parameter  of  Q  corresponding  to  x  is  referenced  within  Q. 

Synopsis:  A  local-export  data  binding  triple  (P,  x,  Q)  represents  the  fact  that 
module  P  defines  the  value  of  local  variable  x,  and  the  corresponding  formal  pa- 
rameter is  referenced  within  module  Q.  It  is  based  on  the  binding  between  the  local 
variable  and  the  corresponding  formal  parameter  and  the  direction  of  the  information 

flow  (from  P  to  Q). 

Example  4  Local-export  data  binding  triple  (P,x,Q) 

P()  Q(a: [by  reference,by  value-result,   by  value]) 

{  x: local  { 

x<-   //   is   defined  <-a  //   is   referenced 

Q(x) 
}  > 

Definition  12  Let  P,   Q  be  two  modules  and  x  be  a  local  variable  in  P.    There  is  a 
local-import  data  binding    triple  (Q,x,P)  if 

1.3  a  call  Q(. . . ,  x, . . .)  in  P  such  that  the  formal  parameter  of  Q  corresponding 
to  x  is  a  call-by-reference  or  call-by-value-result  parameter  and  is  defined  within 
Q,  and 

2.  x  is  referenced  within  P  after  the  call. 

Synopsis:  A  local-import  data  binding  triple  (Q,x,P)  represents  the  fact  that 
module  Q  defines  the  value  of  the  formal  parameter  a,  and  the  corresponding  actual 
parameter  in  a  call  to  Q  within  P,  is  referenced  within  P  after  the  call.  It  reflects 
the  data  relationship  between  modules  P  and  Q  due  to  the  local  variable-formal 
parameter  bindings,  and  the  direction  of  the  information  flow  (from  Q  to  P). 
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Example  5  Local-import  data  binding  triple  (Q,x,P) 

P()  Q(a: [by  reference.by  value-result]) 

{  x: local  { 

Q(x)  a<-   //   is  defined 

<-x  //  is  referenced 
}  > 

The  data  bindings  due  to  return  value  relationships  are  handled  similar  to 
local-import  data  bindings.  First,  one  of  several  transformations  are  performed  on 
the  function  invocation  and  invoked  function  definition  as  follows: 

•  Invoked  function  definition  transformations: 


C  CODE 
Q(b,c) 
{ 

return  a; 


TRANSFORMATION 

Q(a,b,c) 

{ 

a<-  //  is  defined 


} 
C  CODE 

q(b,c) 

{ 

return  val; 


TRANSFORMATION 

q(<retval>,b,c) 

{ 

<retval>  <-  val  //  is  defined 


•  Function  invocation  transformations: 


1. 


2. 


C  CODE 

PC.) 

{x; 

TRANSFORMATION 

PC.) 
{x; 

x  =  q(y,z); 

<-  x  //  is  used 

q(x,y,z); 

<-  x  //  is  used 

}'*' 

>'" 

C  code 

PC) 

{ 

transformation 

PC.) 

{x; 

exp(q(y,z)) ; 

q(x,y,z); 
exp(x) ; 
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Definition  13  Let  P,  Q  be  two  modules  and  P  uses  the  return  value  from  Q  after  a 
call  Q(...)  in  P.  Perform  one  of  two  kind  of  transformations  on  the  invoked  module 
definition  and  the  module  invocation.  Let  a  or  <  retval  >  be  the  formal  parameter 
resulting  from  the  transformation  on  the  invoked  module  definition.  There  is  a  return- 
value  data  binding    triple  (Q,  a,  P)  or  (Q,  <  retval  >,P)  if 

1.  3  a  call  Q(x)  in  P  such  that  x  is  a  transformation-generated  local  variable 
in  P  corresponding  to  the  transformation-generated  formal  parameter  a  or  < 
retval  >  in  Q  that  is  defined  within  Q,  and 

2.  x  is  referenced  within  P  after  the  call. 

Synopsis:  A  return-value  data  binding  triple  (Q,a,P)  or  (Q,  <  retval  >,P) 
represents  the  fact  that  module  Q  defines  a  return-value,  returns  it  to  module  P,  and 
this  return  value  is  referenced  within  module  P  after  the  call.  Either,  the  return- 
value  is  saved  in  a  variable  after  the  call  for  later  reference,  or  it  is  directly  referenced 
after  the  call.  Notice  that  the  data  binding  is  expressed  in  terms  of  the  return  value 
variable  as  opposed  to  the  local  variable  in  the  invoking  module. 

Definition  14  Let  P,  Q  be  two  modules  and  g  be  a  global  variable  in  P.  There  is  a 
global-export  data  binding    triple  (P,g,Q)  if 

1.  g  is  defined  within  P,  before  a  call  Q(. . .  ,g, . . .)  in  P ,  and 

2.  the  formal  parameter  of  Q  corresponding  to  g  is  referenced  within  Q. 

Synopsis:  A  global-export  data  binding  triple  (P,g:  Q)  is  symmetric  to  the  local- 
export  data  binding  triple  except  that  in  the  former  the  local  variable  x  is  replaced 
by  a  global  variable  g. 
Example  6  Global-export  data  binding  triple  (P,g,Q) 
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P()  Q(a: [by  reference, by  value-result,  by  value]) 

{  { 

g<-  //  is  defined    <-a  //  is  referenced 

Q(g) 
}  > 

Definition  15  Let  P,   Q  be  two  modules  and  g  be  a  global  variable  in  P.    There  is  a 
global-import  data  binding    triple  (Q,g,P)  if 

1.3  a  call  Q(. . .  ,g, . . .)  in  P  such  that  the  formal  parameter  of  Q  corresponding 
to  g  is  a  call-by-reference  or  call-by-value-result  parameter  and  is  defined  within 
Q,  and 

2.  g  is  referenced  within  P  after  the  call. 

Synopsis:    A  global-import  data  binding  triple  (Q,g,P)  is  symmetric  to  the 
local-import  data  binding  triple  except  that  in  the  former  the  local  variable  x  is 
replaced  by  a  global  variable  g. 
Example  7  Global-import  data  binding  triple  (Q,g,P) 

P()  Q(a:[by  reference, by  value-result]) 

{  { 

Q(g)  a<-   //   is  defined 

<-g  //  is  referenced 
}  } 

Whenever  it  is  not  confusing,  we  will  use  the  term  data  binding  interchangeably 
with  any  of  the  forms  of  data  bindings  above. 

Definition  16  Let  Type(v)  be  the  type  of  variable  v.  Type  size  of  variable  v,  denoted 
Tsize(v),  is  the  amount  of  information  carried  by  type  Type(v)  of  variable  v. 

Tsize(v)  quantifies  the  information  level  associated  with  variable  v  due  to  the 
its  type.  In  a  conventional  programming  language,  such  as  C  or  Ada,  a  primitive  type 
is  the  least  difficult  to  understand  of  its  types.  Since  a  pointer  represents  the  address 
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Figure  5.1.  Schematic  illustrations  of  access  pairs  and  data  bindings 
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Table  5.1.  Type  size  associated  with  several  types. 

TYPE  (Based  on  "C"/  "Ada")  TSIZE 

primitive  types 

e.g.:  int,  char,  float,     1 
void,boolean,enum       1 
pointer  1 

array  /(Tsize(array[i]), 

No.  of  elements, 
Control  Info  Size) 
user-defined  types;  e.g:struct,  union  Sum  of  the  Tsizes  of  elements 


of  an  object,  we  conclude  that  it  is  as  easy  to  understand  as  primitive  types.  The 
user-defined  types,  such  as  structures  and  unions,  are  more  complicated  because  their 
components'  information  level;  thus,  we  assume  that  the  difficulty  of  understanding 
a  user-defined  type  is  the  added  difficulty  of  understanding  each  of  its  components. 
Finally,  an  array  has  special  mechanisms  to  manipulate  its  elements  which  adds  to 
the  difficulty  of  understanding  the  array. 

Table  5.1  specifies  the  type  size  for  the  type  found  in  two  conventional  program- 
ming languages.  The  higher  the  Tsize  value,  the  more  information  is  carried  by  the 
type  and  it  is  more  difficult  to  understand  a  variable  of  such  type.  For  arrays,  control 
info  size  refers  to  the  complexity  associated  with  the  array  control  mechanisms  of 
the  particular  language  under  analysis  (e.g.:  in  C,  type  control  size  =  2:  1  unit  to 
account  for  the  index  -  offset  -  used  in  addressing  the  individual  elements  of  the  array 
and  1  unit  to  account  for  the  storage  of  the  address  of  the  beginning  of  the  array). 

Definition  17  A  group  G  is  a  pair  of  sets  of  modules  and  global  variables. 

Definition  18  The  boundary  of  group  G  is  the  set  of  data  bindings  (Mi,v,Mj)  such 
that  either  Mi  e  G  and  Mj  <£  G,  or  Mt  g-  G  and  Mj  G  G. 
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In  the  case  of  a  module-global  access  data  binding  (Mi,g,  Mj)  in  the  boundary 
of  group  G,  there  are  two  associated  access  pairs  in  the  boundary,  namely  (M{,g) 
and  (Mj,g). 

Definition  19  The  interior  set  of  group  G  is  the  set  of  data  bindings  (Mi,v,Mj)  such 
that  both  Mi  e  G  and  Mj  €  G. 

Let  \S\  be  the  number  of  elements  in  set  5". 

5.3.2     Inter-group  Complexity  Factors 

The  inter-group  complexity  of  group  G  measures  the  complexity  in  the  interface 
between  the  group  G  and  other  groups  of  the  system.  The  inter-group  complexity  of 
group  G  is  characterized  by  the  union  of  all  the  primitive  measurements  of  inter-group 
complexity.  The  total  inter-group  complexity  of  a  system  is  the  average  inter-group 
complexity  of  the  groups  in  the  system.  Another  interesting  measure  is  to  consider 
the  maximum  inter-group  complexity  of  all  the  groups  in  the  system. 

5.3.2.1      Data.-ba.sed  primitive  measurements 

In  this  section,  we  present  the  primitive  measurements  of  the  inter-group  com- 
plexity which  are  based  on  the  data  of  the  system. 

•  fi(G)  is  the  set  of  global  variables  used  by  modules  inside  group  G  and  also 
used  by  modules  outside  group  G,  or  the  set  of  direct  interface  variables,  i.e.: 

fi(G)  =  {  g  \3  module-global    access    pairs  (M,g)  and  {N,g)  such  that  M  £ 
G  and  N  <£  G.} 

Assume  that  a  global  variable  which  is  concurrently  used  by  modules  in  two 
groups,  increases  the  inter-group  complexity  of  both  groups. 
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We  expect  that  the  more  global  variables  are  shared  between  a  group  and  other 
groups,  the  more  this  group  is  related  to  these  other  groups.  The  relationship 
is  due  to  the  potential  of  sharing  the  same  data  space  among  several  groups. 

Beladi's  intercluster  complexity  [3]  is  proportional  to  N,  the  total  number  of 
nodes  (where  nodes  are  either  modules  or  global  variables)  in  a  system.  This 
complexity  is  given  by  C0  —  N  *  E0  where  E0  is  the  number  of  "intercluster" 
edges.  Thus,  given  the  number  of  global  variables  JV,  we  conclude  that  the 
number  of  global  variables  increases  the  inter-group  complexity. 

•  /2(G)  is  the  set  of  global  variables  outside  of  group  G  indirectly  used  by  modules 
within  group  G,  or  the  set  of  indirect  interface  variables,  i.e.: 

/2(G)  —  {  g  I  3  module-global  indirect  access  pair  (M,g)such  that  M  £ 
G  and  g  $•  G.} 

We  expect  that  the  more  global  variables  which  indirectly  affect  or  are  affected 
by  modules  in  the  group,  the  probability  that  the  group  is  indirectly  related 
to  other  groups  increases.  This  relationship  is  due  to  the  indirect  knowledge 
about  a  global  that  the  module  must  have. 

In  this  factor,  Beladi's  intercluster  complexity  is  also  used  to  verify  our  intuition. 
Since  the  intercluster  complexity  is  proportional  to  N,  the  total  number  of 
nodes  in  a  system,  we  conclude  that  the  number  of  global  variables  increases 
the  inter-group  complexity. 

•  /3(G)  is  the  boundary  set  of  group  G,  i.e.: 

/3(G?)  =  {(M{,  v,  Mj)  I  3  a  data  binding  (Mi,  v,  Mj)  in  the  boundary  set  of  G.} 

The  intercluster  complexity  is  proportional  to  the  number  of  "intercluster  edges" 
[3].  Accordingly,  the  more  intercluster  edges  there  is,  the  more  complex  is  the 
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interface  of  the  system.  In  this  factor,  the  data  bindings  across  groups  bound- 
aries correspond  to  the  intercluster  edges.  Hence,  the  complexity  increases  with 
the  number  of  data  bindings.  A  similar  observation  was  made  by  Henry  and 
Kafura  [11]  in  the  case  of  modularity  metrics. 

•  /4(G)  is  the  set  of  different  variables  transferring  information  between  group  G 
and  other  groups,  or  the  set  of  variables  in  the  boundary  set,  i.e.: 

f4(G)  =  {v  I  3  a  data  binding  (Mi,v,Mj)  in  the  boundary  set  of  G.} 

Assume  that  variables  which  transfer  information  between  the  group  and  other 
groups,  thus  defining  data  bindings,  relate  the  group  to  those  other  groups. 
Hence,  the  more  variables  relate  a  group  to  other  groups,  the  inter-group  com- 
plexity increases.  Similarly,  Beladi's  intercluster  complexity  considers  unique 
nodes,  where  a  node  is  either  a  module  or  a  global  data  in  the  system,  as  a  factor 
which  increases  intercluster  complexity.  Factor  f4  considers  unique  occurrence 
of  variables  in  the  boundary  set. 

5.3.2.2     Type-based  primitive  measurements 

In  this  section,  we  present  the  primitive  measurements  of  the  inter-group  com- 
plexity which  are  based  on  the  types  of  data  in  the  system. 

Assume  that  the  information  level  of  a  variable  is  determined  by  its  type  size 
(Tsize)  according  to  Definition  16. 

•  /5(G)  =  1S  tne  sum  of  type  sizes  of  the  global  variables  used  by  group  G  and  also 
used  by  modules  outside  group  G,  or  the  sum  of  type  sizes  of  direct  interface 
variables,  i.e.: 

f5(G)  =  £  Tsize(g) 

{g  I  3  module  —global  access  pairs  (M,g)  and  (N,g) 
9£  such  that  M  EG  and  N  £  G.} 
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We  expect  that  the  higher  total  type  size  of  all  global  variables  concurrently 
used  by  modules  inside  and  outside  the  group,  the  probability  that  the  group  is 
related  to  those  other  groups  increases  and  the  inter-group  complexity  increases. 

•  fe,(G)  =  is  the  sum  of  type  sizes  of  global  variables  outside  group  G  indirectly 
used  by  modules  within  group  G,  or  the  sum  of  type  sizes  of  indirect  interface 
variables,  i.e.: 

/•(G)  =  E  Tsize(g) 

{g  |  3  module  —  global  indirect  access  pair  (M,  g) 
such  that  M  £  G  and   g  £  G.} 

We  expect  that  the  higher  total  type  size  of  the  global  variables  outside  the 
group  indirectly  used  by  modules  inside  the  group,  the  probability  that  the 
group  is  related  to  those  other  groups  increases  and  the  inter-group  complexity 
increases. 

•  Consider  the  set  of  data  binding  triples  in  the  boundary  set  of  a  group.  /7(G) 
is  the  sum  of  type  sizes  of  boundary  set  variables. 

The  sum  of  the  type  sizes  of  different  variables  passing  data  into  group  G  is: 

/7.1(G)  =  £  Tsize(v) 

v£{v  I  3  data  binding  (Mi,v,Mj)£  boundary  set  of  G  B  MigG  and  MjGG.} 

The  sum  of  the  type  sizes  of  different  variables  passing  data  out  of  group  G  is: 
/7.2(G)  =  J2  Tsize{v) 

v£{v  I  3  data  binding  (Mi,V,Mj)S  boundary  set  of  G  9M;   £  G  and  Mj&G.} 

Assume  that  variables  which  transfer  information  between  the  group  and  other 
groups,  defining  data  bindings,  contribute  to  the  inter-group  complexity  of  the 
group  according  to  the  variable's  type  size.  We  expect  that  the  higher  total 
type  size  of  the  variables  passing  data  to  or  from  the  group,  the  inter-group 
complexity  increases. 
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•  fs(G)  is  the  set  of  different  types  of  variables  passing  data  between  group  G 
and  other  groups,  or  the  set  of  types  of  boundary  set  variables,  i.e.: 

fs(G)  =  {Type(v)  \  3  data  binding(M{,v1  Mj)  £    boundary  set  of  G.} 

Assume  that  each  type  of  variable  passing  data  to  or  from  the  group  needs 
to  be  supported  by  the  group's  interface.  Each  type  increases  the  inter-group 
complexity.  Consequently,  the  more  different  types  of  variables  are  supported, 
the  inter-group  complexity  increases.  That  is  due  to  the  different  behaviors 
which  need  to  be  supported  by  the  group's  interface. 

5.3.3     Intra-group  Complexity  Factors 

The  intra-group  complexity  of  group  G  consists  of  the  complexity  of  group  G  in 
a  system  partition.  The  intra-group  complexity  of  group  G  is  characterized  by  the 
union  of  all  primitive  measurements  of  intra-group  complexity.  The  total  intra-group 
complexity  of  a  system  is  the  sum  of  the  intra-group  complexity  of  all  the  system's 
groups. 

Another  measure  related  to  these  factors  is  the  intra-group  strength1  which  is  a 
measure  of  the  relationship  between  the  modules  in  a  group. 

5.3.3.1     Data-based  primitive  measurements 

In  this  section,  we  present  the  primitive  measurements  of  the  intra-group  com- 
plexity which  are  based  on  the  data  of  the  system. 

•  /9(G)  is  the  set  of  global  variables  used  exclusively  by  modules  in  group  G,  or 
the  set  of  direct  internal  variables,  i.e.: 

/9(G)  =  {  g  I  V  module-global    access    pair  (Mi,g)  3    Mi  £  G.} 


1This  term  was  originally  used  by  Myers  [27]. 
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•  fw{G)  is  the  set  of  global  variables  used  indirectly  only  by  modules  in  group 
G,  or  the  set  of  indirect  internal  variables,  i.e.: 

/10(G)  —  {  g  I  V  module-global   indirect    access    pair  (Mi,g)  3    Mi  G  G.} 

Factors  fg  and  /Xo  above  consists  of  the  global  variables  "exclusively  used"  by 
modules  in  the  group. 

Assume  that  a  global  variable  exclusively  used  by  modules  in  the  group  weakens 
the  intra-group  strength  and  increases  the  intra-group  complexity.  According 
to  [3],  a  global  variable  negatively  affects  the  intracluster  complexity  by  increas- 
ing its  complexity  value,  given  a  set  of  edges.  Consequently,  a  global  variable 
increases  the  intra-group  complexity.  Also,  we  argue  that  a  global  variable  neg- 
atively affects  the  intra-group  strength  since  other  groups  may  use  this  global 
variable  thus  reducing  the  functional  relatedness  of  the  group. 

We  expect  that  the  more  global  variables  (exclusively  used  by  modules  in  the 
group  and  thus  not  used  by  modules  in  other  groups),  the  intra-group  com- 
plexity increases.  An  extreme  case  occurs  when  the  global  variable  becomes  a 
local  variable  with  respect  to  the  group.  Similarly,  we  expect  that  the  strength 
between  modules  in  the  group  weakens. 

•  Consider  the  set  of  data  bindings  triples  in  the  interior  set  of  a  group.  /11(G) 
is  the  interior  set  of  group  G,  i.e.: 

/11.1(G)  is  the  set  of  module-global  access  data  bindings  involving  modules 
within  group  G,  i.e.: 

/11.1(G)  =  {(Mi,v,Mj)  I  3  a  module-global    access    data   binding  (Mi,v,Mj) 
in  the  interior  set  of  G.} 
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/11.2(G)  is  the  set  of  local-export  data  bindings  involving  modules  within  group 
G,  i.e.: 

/11.2(G)  =  {(Mj-,u,Mj)  I  3  a  local-export    data   binding  (Mi,v,Mj)  in  the 
interior  set  of  G.} 

/11.3(G)  is  the  set  of  local-import  data  bindings  involving  modules  within  group 
G,  i.e.: 

/11.3(G)  =  {(Mi,v,M/)  I  3  a  local-import    data   binding  (M,-,u,Mj)  in  the 
interior  set  of  G.} 

/11.4(G)  is  the  set  of  global- export  data  bindings  involving  modules  within  group 
G,  i.e.: 

/11.4(G)  —  {(Mi, v,  ftij)  I  3  a  global-export    data   binding  (Mi,v,Mj)  in  the 
interior  set  of  G.} 

/11.5(G)  is  the  set  of  global-import  data  bindings  involving  modules  within 
group  G,  i.e.: 

/11.5(G)  =  {(Mi,v,Mj)  I  3  a  global-import    data   binding  (M,-,u,  Af,-)  in  the 
interior  set  of  G.} 

/11.6(G)  is  the  set  of  return-value  data  bindings  involving  modules  within  group 
G,  i.e.: 

/11.6(G)  =  {(Mi,v,Mj)  I  3  a  return-value   data   binding  (Mi,v,Mj)  in  the 
interior  set  of  G.} 

According  to  Beladi  and  Evangelisti  [3],  the  intracluster  complexity  of  a  single 
cluster  j  is  Cj  —  tlj  *  6j  where  nj  is  the  number  of  nodes  in  the  cluster  and  ej  the 
number  of  pairwise  edges,  that  is,  "connections  between  the  same  elements." 
Given  that  edges  correspond  to  data  bindings  in  our  model,  we  conclude  that 
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the  more  data  bindings  between  the  modules  in  a  group,  the  higher  the  intra- 
group  complexity  of  the  group. 

Furthermore,  the  intra-group  strength  increases  with  the  number  of  data  bind- 
ings since  the  more  "channels"  of  data  passing  between  modules  in  the  group, 
the  higher  the  functional  relatedness  of  the  group. 

It  appears  as  if  factors  /n  conflict  with  factors  /9  and  /10;  however,  data  bind- 
ings define  relations  between  modules  which  theoretically  speaking  will  increase 
the  strength.  However,  the  fact  that  the  data  bindings  are  sometimes  due  to 
global  variables  involves  a  risk  in  that  other  groups  in  the  system  may  access 
these  global  variables  which  weakens  the  strength  of  the  group. 

•  This  factor  applies  to  groups  derived  from  using  the  globals-based  analysis 
(Algorithm  1):  /12(G)  is  the  percentage  of  modules  in  group  G  which  directly 
use  all  global  variables  in  group  G,  i.e.: 

j 12(G)  —  \{M  I  3  module-global  access  pair  {M,gi)  V  global  variable  gx  G 
G  andM  G  G.}\/\{M  |  M  G  G}\ 

Assume  that  a  global  variable  used  by  all  modules  in  the  group  increases  the 
intra-group  strength.  The  higher  the  ratio  of  the  number  of  modules  which 
directly  use  all  global  variables  in  the  group  to  the  total  number  of  modules, 
the  intra-group  strength  increases. 

5.3.3.2     Type-based  primitive  measurements 

In  this  section,  we  present  the  primitive  measurements  of  the  intra-group  com- 
plexity which  are  based  on  the  types  of  data  in  the  system. 

Assume  that  the  information  level  of  a  variable  is  determined  by  its  type  size 
(Tsize)  according  to  Definition  16. 
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•  Consider  the  set  of  data  bindings  triples  in  the  interior  set  of  group  G. 

Also,  consider  only  the  type  of  different  variables  since  we  measure  the  com- 
plexity due  to  the  kind  of  information  passed  between  modules  of  the  group,  as 
opposed  to  the  volume  of  this  information. 

/13(G)  is  the  sum  of  the  type  sizes  of  different  variables  passing  data  between 
modules  Mi  and  Mj  in  group  G: 

/13(G)  =  £  Tsize{v) 

ve  {v  I  3  data  binding  (Mi,v,Mf)  €    interior  set  of  G.} 

We  expect  that  the  higher  total  type  size  of  all  the  variables  used  by  modules 
in  the  group,  the  intra-group  complexity  increases  and  the  intra-group  strength 
decreases. 

•  /14(G)  is  the  set  of  types  of  different  variables  passing  data  between  modules 
Mi  and  Mj  in  group  G,  or  the  set  of  types  of  interior  set  variables,  i.e.: 

/14(G)  =    {Type(v)  I  3  data  binding  (M,-,u,  Mj)  6    the  interior  set  of  G.} 

Assume  that  each  type  of  variable  passing  data  among  the  modules  in  the  group 
needs  to  be  supported  by  the  module.  Each  type  increases  the  intra-group 
complexity.  Therefore,  the  more  different  types  of  variables  are  supported,  the 
intra-group  complexity  increases.  Also,  less  commonality  is  observed  in  the 
group  and  the  intra-group  strength  decreases  due  to  the  reduced  functional 
relatedness  of  the  group. 

•  The  following  factors  only  apply  to  groups  derived  from  using  the  types-based 
analysis  (Algorithm  2): 
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Definition  20  The  Base  Type  of  group  G,  denoted  Btype(G),  is  the  type  used 
as  the  grouping  criteria  during  the  "grouping"  step  of  the  Types-based  Object 
Finder,  Algorithm  2. 

Definition  21  Module  M  manipulates  type  t,  denoted  by  the  manipulation  pair 
[M,  t],  if  t  is  the  type  of  a  formal  parameter  of '  M ,  or  t  is  the  type  of  the  return 
value  of  M,  or  t  =  Type(g)  such  that  3  access  pair  (M,g).  This  is,  t  is  one 
of  the  types  that  module  M  may  manipulate. 

Definition  22  A  grouping  manipulation  pair  in  group  G  is  a  manipulation  pair 
[M,t]  such  that  M  G  G  and  t  =  Btype(G). 

There  exists  a  grouping  manipulation  pair  for  a  module  M,  M  €  G,  and  type  t, 
denoted  [M,  t],  iff  any  of  module  M  types  of  formal  parameters  or  return  value, 
i,  is  equivalent2  to  the  base  type  of  group  G,  Btype(G);  i.e.:  t  =  Btype(G). 
Also,  there  exists  a  grouping  manipulation  pair  for  a  module  M,  M  £  G,  and 
type  of  variable  v,  Type(v),  denoted  [M, Type(v)],  iff  there  exists  a  module- 
global  access  pair  (M,  v)  and  the  type  of  variable  v,  Type(v),  is  equivalent  to 
the  base  type  of  group  G;  i.e.:  Type(v)  =  Btype(G). 

/i5(G)  is  the  set  of  grouping  manipulation  pairs  [M,t]  such  that  M  €  G,  i.e.: 

/15(G)  =  {[Af,  t]  is  a  grouping  manipulation  pair  \  M  £  G.} 

We  expect  that  the  more  grouping  manipulation  pairs,  the  intra-group  strength 
increases.  The  evidence  indicates  that  this  factor  does  not  affect  the  intra-group 
complexity. 


2Equivalent  types  are  defined  in  Algorithm  2  in  Chapter  3. 
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•  Definition  23  A  degrouping  manipulation  pair  in  group  G  is  a  manipulation 
pair  [M,t]  such  that  M  eG  and  t  <  Btype(G). 

There  exists  a  degrouping  manipulation  pair  for  a  module  M,  M  E  G,  and  type 
t,  denoted  [M,  t],  iff  any  of  module  M  types  t  of  formal  parameters  or  return 
value  is  a  "part  of"  the  base  type  of  group  G  or  "contains"  type  that  is  the 
base  type  of  group  G,  Btype(G);  i.e.:  t  <<  Btype(G)  or  i  >  Btype(G).  Also, 
there  exists  a  degrouping  manipulation  pair  for  a  module  M,  M  E  G,  and  type 
of  variable  v,  Type{v),  denoted  [M,  Type(v)],  iff  there  exists  a  module-global 
access  pair  (M,v)  and  the  type  of  variable  v,  Type(v),  is  a  "part  of"  the  base 
type  of  group  G  or  "contains"  type  that  is  the  base  type  of  group  G,  Btype(G); 
i.e.:  Type(v)  <C  Btype(G)  or  Type(v)  ^>  Btype(G). 

fie(G)  is  the  set  of  degrouping  manipulation  pairs  [M,  t]  such  that  M  E  G,  i.e.: 

fie(G)  —  {[M,  t]  is  a  degrouping  manipulation  pair  \  M  E  G.} 

We  expect  that  the  more  degrouping  manipulation  pairs,  the  intra-group  strength 
decreases.  Once  more,  this  factor  does  not  affect  the  intra-group  complexity. 

•  /17(G)  =  the  ratio  of  number  of  grouping  manipulation  pairs  to  the  number  of 
degrouping  manipulation  pairs. 

fxr(G)  =  \MG)\  :  |/16(G)| 

Given  factors  /15  and  /i6  above,  we  expect  that  the  higher  the  ratio  of  grouping 
manipulation  pairs  to  degrouping  manipulation  pairs,  the  intra-group  strength 
increases. 

In  the  case  that  types  t  and  Btype(G)  are  not  explicitly  related  (e.g.,  types  are 
equivalent  or  there  is  a  "part  of"  or  "contains"  relationship  between  the  types), 
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Figure  5.2.  Example  of  the  primitive  complexity  metrics  factors 

the  corresponding  manipulation  pairs  of  the  form  [M,  t]  are  not  considered  in 
factors  /15,  /16,  and  fu. 

5.3.4     Example  of  the  Factors 

This  section  presents  an  example  of  the  primitive  complexity  metrics  factors. 
Figure  5.2  consists  of  a  set  of  several  routines,  global  variables,  and  a  group. 

Examples  of  the  complexity  metrics  factors,  in  Figure  5.2,  related  to  group  G 
are: 


1.   Inter-group  Complexity 

•  f3  Boundary  set: 

f3{G)  =  {(Muvl:M2),(M3,v3,M5),(M3,v3,Mr),(M4,v4,M8), 
(M5,  u5,  Ms),  (M10,  v5,  Mn),  (Mio,  v5,  Mn)} 
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•  Ia  Variables  in  boundary  set:  /4(G)  =  {^1,^3,^4,^5} 

•  fi  Sum  of  type  sizes  of  different  variables  in  boundary  set:    /7(G)    = 

Tsize(v\)  +  Tsize(v3)  +  Tsize(v4)  +  Tsize(v$) 

2.   Intra-group  Complexity 

•  /n  Interior  set: 

ftl(G)  =  {(M2,U2,M6),(M2,W3,M7),(M2,U3,M5),(M5,U5,M11)} 

•  /13  Sum  of  type  sizes  of  different  variables  in  interior  set:    /13(G)    = 
Tsize(v2)  +  Tsize(v%)  +  Tsize(vs) 

5.3.5     Validation  of  the  Factors 

This  section  presents  the  validation  of  the  complexity  metrics  factors  of  the 
previous  sections.  The  validation  approach  consists  of  proving  that  the  validation 
hypothesis  holds.  This  approach  is  described  in  Figure  5.5.  This  section  also  illus- 
trates the  sensitivity  of  the  factors  to  a  particular  modification  of  a  program.  More 
experimentation  is  needed  to  further  validate  these  factors  with  respects  to  other  kind 
of  changes  applied  to  programs.  Other  interesting  modifications  to  programs  include 
a  functionally  equivalent  program  with  several  functions  merged  into  a  single  func- 
tion or  several  abstract  data  types  of  the  original  program  replaced  with  equivalent 
data  structures. 

The  example  program  consists  of  a  simple  recursive  descent  expression  parser 
implemented  in  "C".  First,  we  present  the  complexity  metrics  computed  on  a  ver- 
sion of  this  program  which  consists  of  two  source  files,  eighteen  functions,  and  four 
global  variables.  Second,  we  present  the  metrics  computed  on  another  version  of  this 
program  with  the  same  functionality  except  that  most  function  parameters  are  con- 
verted into  global  variables;  this  version  consists  of  the  same  number  of  source  files  as 
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Table  5.2.  Type  size  associated  with  variables  of  different  types  in  the  original  version 
of  the  recursive  descent  expression  parser. 


Variable 

Type 

TSIZE 

prog 

char* 

1 

tok_type 

char 

1 

token 

char [80] 

1  *  80  +  2  = 

=  82 

vars 

float [26] 

1  *  26  +  2  = 

=  28 

"answer" 

float 

1 

"op" 

register  char 

1 

"result" 

float* 

1 

~c~ 

char 

1 

"hold- 

float 

1 

the  previous  version  but  with  a  total  of  nineteen  functions  and  six  global  variables. 
Third,  we  explain  the  effect  that  the  structure  of  each  version  of  the  program  had  on 
the  primitive  complexity  metrics  factors  and  illustrate  the  sensitivity  of  the  metrics 
to  this  class  of  changes  in  the  structure  of  a  program. 

In  the  first  version  of  the  program,  named  the  original  version,  the  top-down 
analysis  approach  was  used  with  both  globals-based  and  types-based  analyses;  we 
choose  to  ignore  C  primitive  types  during  the  types-based  analysis.  The  identi- 
fied objects  are  shown  in  Figure  5.3.  Some  modifications  of  the  candidate  objects 
were  performed  to  obtain  completely  disjoint  candidate  objects  in  terms  of  their 
routines.  These  modifications  consisted  of  removing  common  components  (routines) 
from  object  Gtok_type#2  to  a  single  candidate  object.  The  resulting  objects  after 
the  modifications  are  shown  in  Figure  5.3. 

Table  5.2  illustrates  the  types  sizes  corresponding  to  the  types  of  variables  in 
the  original  version  of  the  example.  The  primitive  metrics  factors  were  computed  for 
inter-group  complexity,  intra-group  complexity,  and  intra-group  strength. 
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Obj 


} 

Obj 


ect  z#objTchar*#3  is  { 
~~char*~76 

Float_Parsing~ II~iswhite~181 

Float_Parsing~ II~isdelim~175 

Float_Parsing~ II~is_in~172 


} 
Obj 


ect  z#objTfloat*#4  i 
~~float*~24 

Float_Parsing~ II 

Float .Parsing" II 

Float .Parsing" II 

Float_Parsing~ II 

Float_Parsing~ II 

Float_Parsing" II 

Float_Parsing~ II 

Float _Parsing~ II 


{ 

primitive~240 

level6~198 

level5~196 

level4~194 

level3~192 

level2~190 

levell~188 

get_exp~143 


} 
Obj 


ect  z#objT+undetermined-type#5  is  { 

~~float*~24 

~~char*~76 

Float _Par sing" II "unary "371 

Float_Parsing~ II~f ind_var~100 

Float_Parsing~ II~arith"21 

ect  z#objGtok_type#2  is  { 

main" II~prog~243 

Float _Par sing" II~vars~374 

Float _Parsing~ II~token~357 

Float_Parsing~ II~tok_type~356 

main" I I "main" 206 

Float_Parsing~ II~putback~244 

Float_Parsing~ II~get_token~145 

Float_Parsing~ II~f ind_var~100 


(T)char* 
(R)iswhite 
(R)isdelim 
(R)is_in 


(T)float* 

(R)primitive 

(R)level6 

(R)level5 

(R)level4 

(R)level3 

(R)level2 

(R) level 1 

(R)get_exp 


(T)float* 
(T)char* 
(R) unary 
(R)f ind_var 
(R)arith 


(G)prog 
(G)vars 
(G) token 
(G)tok_type 
(R)main 
(R)putback 
(R)get_token 
(R)f ind_var 


Figure  5.3.  Identified  objects  in  original  version  of  recursive  descent  expression  parser 
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Table  5.3.   Type  size  associated  with  variables  of  different  types  in  version  1  of  the 
recursive  descent  expression  parser. 


Variable 

Type 

TSIZE 

prog 

char  * 

1 

tok_type 

char 

1 

token 

char [80] 

1  *  80  +  2  =  82 

vars 

float [26] 

1  *  26  +  2  =  28 

answer 

float [1048] 

1  *  1048  +  2  =  1050 

result 

float  * 

1 

~OP~ 

register  char 

1 

~c~ 

char 

1 

"hold" 

float 

1 

For  the  other  version  of  the  program,  named  version  1,  the  top-down  analysis 
approach  was  used  with  both  globals-based  and  types-based  analyses;  we  choose  to 
ignore  C  primitive  types  during  the  types-based  analysis.  The  identified  objects  are 
shown  in  Figure  5.4.  Some  modifications  of  the  candidate  objects  were  performed 
to  obtain  completely  disjoint  candidate  objects,  specifically  routine  f  ind_var  was 
removed  from  object  T+undetermined-type#4.  The  resulting  objects  after  the  mod- 
ifications are  shown  in  Figure  5.4. 

Table  5.3  illustrates  the  types  sizes  used  for  the  metrics  computation.  In  this 
version  of  the  program,  as  well  as  the  rest  of  the  examples  in  this  thesis,  whenever 
possible,  the  name  of  an  identifier  denotes  the  identifier,  instead  of  using  the  unique 
ID  which  is  automatically  generated  by  the  object  finder  tool  of  Chapter  7.  See 
Section  8.1  for  a  complete  explanation  of  unique  ID  in  the  internal  representation  of 
the  program  used  by  the  object  finder  tool. 

The  primitive  metrics  factors  were  computed  for  inter-group  complexity,  intra- 
group  complexity,  and  intra-group  strength. 
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Obj 


} 
Obj 


} 

Obj 


ect  z#objTchar*#3  is  { 
~~char*~77 

Float _Par sing" II~iswhite~177 

Float_Parsing~ II~isdelim~171 

Float.Parsing" II" is. in" 168 

ect  z#objT+undetermined-type#4  is  { 

~~float*~25 

~~char*~77 

Float .Parsing" II~unary~362 

Float.Parsing" II~arith~22 


ect  z#objGanswer#2 

main" II~prog~232 

Float_Parsing~ II 

Float _Parsing~ II 

Float _Par sing" II 

Float .Parsing" II 

Float .Parsing" II 

main" II~main~196 

Float .Parsing" II 

Float .Parsing" II 

Float.Parsing" II 

Float.Parsing" II 

Float.Parsing" II 

Float.Parsing" II 

Float.Parsing" II 

Float.Parsing" II 

Float.Parsing" II 

Float.Parsing" II 

Float.Parsing" II 

Float.Parsing" II 


is  t 

~vars"365 

~token~348 

~tok_type~347 

~result~309 

~answer~20 

"result _inc"310 

~putback~233 

~primitive"230 

~level6~189 

~level5~188 

~level4~187 

~level3~186 

~level2~185 

~levell~184 

"get .token" 145 

~get_exp~144 

'find_var~101 


(T)char* 
(R)iswhite 
(R)isdelim 
(R)is_in 


(T) float* 
(T)char* 
(R) unary 
(R)arith 


(G)prog 

(G)vars 

(G)token 

(G)tok.type 

(G) result 

(G) answer 

(R)main 

(R)result.inc 

(R)putback 

(R)primitive 

(R)level6 

(R)level5 

(R)level4 

(R)level3 

(R)level2 

(R)levell 

(R)get.token 

(R)get.exp 

(R)f ind.var 


Figure  5.4.  Identified  objects  in  version  1  of  recursive  descent  expression  parser 
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The  approach  to  validate  the  complexity  metrics  consists  of  proving  that  the  fol- 
lowing hypothesis  hold.  The  hypothesis  states  that  the  primitive  complexity  metrics 
factors  are  sensitive  to  changes  in  the  complexity  caused  by  different  partitionings. 
Different  partitionings  are  the  result  of  (1)  using  different  partitionings  approaches 
under  the  same  connectivity,  or  (2)  totally  different  connectivities.  The  connectivity 
of  a  program  is  the  set  of  relationships  between  components  of  the  program  defined  by 
the  data  bindings  among  components  due  to  global  variables  and  calling  sequences. 
We  use  the  second  situation  to  prove  that  the  hypothesis  holds. 

The  methodology  to  prove  the  hypothesis  is  to  use  a  case  study  program  to  show 
that  the  primitive  metrics  reflect  different  complexity  measures  in  two  versions  of  the 
program  which  are  functionally  equivalent  and  have  different  connectivities  as  a  result 
of  having  different  structures.  An  schematic  description  of  this  validation  methodolo- 
gy is  given  in  Figure  5.5.  From  Figure  5.5,  the  validation  of  the  metrics  factors  consists 
of  proving  the  following  hypothesis:  If  Co\  is  less  complex  than  Co2,  then  Me\  < 
Me2,  where  Co-i  is  the  expected  complexity  of  the  original  version,  Co2  is  the  ex- 
pected complexity  of  version  1,  Me\  is  the  measured  primitive  metrics  factors  values 
of  the  original  version,  and  Me2  is  the  measured  primitive  metrics  factors  values  ol 
version  1. 

For  the  proof,  we  use  the  program  above  with  two  versions:  the  original  version 
and  version  1.  We  made  version  1  to  be  more  complex  in  terms  of  the  connectivity 
of  the  program,  since  a  program  with  greater  connectivity  is  expected  to  be  more 
complex  given  that  all  other  factors  remain  the  same.  Version  1  is  made  more  complex 
by  replacing  most  formal  parameters  with  global  variables.  Thus,  the  number  of 
expected  relations  between  components  of  the  program  will  increase. 

The  expected  complexity  of  each  of  the  two  versions  is  defined  as  the  complexity 
resulting  from  the  connectivies  between  components  of  a  system.  We  expect  that  the 
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connectivity 


partitioning 


original  version 


version  7 


complexity 


metrics  Me-1 


C-2 

object  finder^ 


P-2 


Co-1 


Co-2 


(''computea) 


Me-2 


Figure  5.5.  Validation  of  the  primitive  complexity  metrics  factors 
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groups  in  the  partitioning  obtained  from  version  1  are  more  intra-group  complex  than 
the  groups  in  the  partitioning  obtained  from  the  original  version.  This  is  the  case 
since  each  group  in  version  1  partitionings  is  expected  to  be  highly  cohesive  by  the 
added  global  variables  in  version  1.  We  also  expect  that  the  groups  in  version  1  are 
less  inter-group  complex  because  the  groups  in  version  1  partitioning  are  expected  to 
be  loosely  connected  by  the  global  variables  in  version  1. 

Each  version  of  the  program  exhibits  different  connectivity  between  the  program 
components.  Hence,  the  object  finder  partitionings  for  each  version  of  the  program 
were  different.  In  addition,  the  objective  of  the  study  that  the  program  versions  were 
functionally  equivalent  but  structurally  different  was  met. 

The  validation  results  for  each  complexity  metric  factor  are  summarized  nex- 
t.  First,  the  results  about  the  complexity  metrics  factors  that  measure  inter-group 
complexity. 

f-i  Set  of  direct  interface  variables  The  output  for  the  two  versions  of  the  pro- 
gram indicated  that  the  original  version  presented  higher  inter-group  complexity  than 
version  1.  This  observation  shows  the  effect  of  different  partitionings  on  the  metric- 
s.  In  addition,  as  expected,  there  is  higher  interactions  between  components  in  the 
original  version  than  between  components  in  version  1. 

U  Set  of  indirect  interface  variables  No  relevant  results  were  obtained  for  this 
factor. 

fa  Boundary  set  The  output  for  the  two  versions  of  the  program  indicated  that 
the  original  version  presented  slightly  higher  inter-group  complexity  than  version  1. 
This  observation  is  in  accordance  with  the  expected  variation  in  the  metrics. 
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fi  Variables  in  boundary  set  The  output  for  the  two  versions  of  the  program 
indicated  that  the  two  versions  presented  similar  inter-group  complexity. 

/s  Sum  of  type  sizes  of  direct  interface  variables  The  output  for  the  two  ver- 
sions of  the  program  indicated  that  the  original  version  presented  higher  inter-group 
complexity  than  version  1.  This  is  the  case  since  version  1  partitioning  reduces  the 
number  of  global  variables  transferring  information  between  components. 

ffi  Sum  of  type  sizes  of  indirect  interface  variables  No  relevant  results  were  ob- 
tained for  this  factor. 

fx  Sum  of  type  sizes  of  boundary  set  variables  The  output  for  the  two  ver- 
sions of  the  program  indicated  that  the  original  version  presented  lower  inter-group 
complexity  than  version  1.  Version  1  added  some  global  variables  to  the  program. 
They  influenced  the  sizes  observed  in  the  new  program.  However,  the  increased  val- 
ue of  the  type  size  is  related  to  the  number  of  relations  (data  bindings)  between 
components.  This  number  was  lower  in  version  1  than  in  the  original  version. 

fa  Set  of  types  of  boundary  set  variables  The  output  for  the  two  versions  of 
the  program  indicated  that  the  original  version  presented  higher  inter-group  com- 
plexity than  version  1.  This  is  the  case  since  version  1  partitioning  reduces  the 
number  of  variables  transferring  information  between  components. 

In  conclusion,  the  inter-group  complexity  in  version  1  is  lower  than  the  one  in 
the  original  version.  This  coincides  with  our  expectations  regarding  the  effect  of  the 
changes  and  the  resulting  partitioning  on  the  primitive  complexity  metrics  factors. 
That  is  to  say,  the  partitioning  resulting  from  version  1  agglomerates  most  relations 
between  components  inside  the  groups  which  reduces  the  inter-group  complexity. 
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fc,  Set  of  direct  internal  variables  The  output  for  the  two  versions  of  the  pro- 
gram indicated  that  version  1  presented  higher  inter-group  complexity  than  the  orig- 
inal version.  This  coincides  with  our  expectations  that  the  intra-group  complexity 
will  increase  in  version  1  since  more  global  variables  are  used. 

An  Set  of  indirect  internal  variables  The  output  for  the  two  versions  of  the 
program  indicated  that  the  two  versions  presented  similar  intra-group  complexity. 
One  reason  for  such  situation  is  the  fact  that  the  code  related  to  factor  /10  is  not 
modified  between  versions. 

ft  i  Interior  set  The  output  for  the  two  versions  of  the  program  indicated  that 
the  original  version  presented  lower  intra-group  complexity  than  version  1.  This 
coincide  with  our  expectation  of  the  intra-group  complexity  will  increase  in  version 
1  since  more  global  variables  are  used. 

/i9  Percentage  of  modules  accessing  all  global  variables     No  relevant  results  were 
obtained  for  this  factor. 

/la  Sum  of  type  sizes  of  interior  set  variables  The  output  for  the  two  versions 
of  the  program  indicated  that  the  original  version  presented  lower  intra-group  com- 
plexity than  version  1.  This  demonstrates  that  factor  /13  increases  with  a  higher 
intra-group  complexity  partitioning,  such  as  the  one  in  version  1  of  the  program. 

/id  Set  of  types  of  interior  set  variables  The  output  for  the  two  versions  of  the 
program  indicated  that  the  original  version  presented  lower  intra-group  complexity 
than  version  1.  This  coincides  with  our  expectations  that  the  intra-group  complexity 
will  increase  in  version  1  since  more  relations  occur. 
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In  conclusion,  the  intra-group  complexity  in  version  1  is  higher  than  the  intra- 
group  complexity  in  the  original  version.  This  coincides  with  our  expectation  regard- 
ing the  effect  of  version  1  partitioning  on  the  metrics.  That  is  to  say,  the  partitioning 
resulting  from  version  1  agglomerates  most  relations  between  components  inside  the 
groups  which  increases  the  intra-group  complexity. 

/is  Set  of  grouping  manipulation  pairs  The  output  for  the  two  versions  of  the 
program  indicated  that  the  original  version  presented  higher  intra-group  strength 
than  version  1.  This  coincides  with  our  expectation  that  the  partitioning  of  version  1 
decreases  the  strength  because  parameters  have  been  replaced  with  global  variables 
and,  as  previously  indicated,  the  addition  of  global  variables  reduces  the  strength  of 
individual  components. 

/ifi  Set  of  degrouping  manipulation  pairs  The  output  for  the  two  versions  of 
the  program  indicated  that  the  original  version  presented  higher  intra-group  strength 
than  version  1.  This  coincides  with  our  expectations  that  the  partitioning  resulting 
from  version  1  decreases  the  strength. 

/i  7  Ratio  of  grouping  to  degrouping  manipulation  pairs  The  output  for  the  t- 
wo  versions  of  the  program  indicated  that  the  original  version  presented  higher  intra- 
group  strength  than  version  1.  This  coincides  with  our  expectations  that  the  parti- 
tioning resulting  from  version  1  decreases  the  strength. 

In  conclusion,  the  intra-group  strength  in  the  original  version  of  the  program 
is  higher  than  the  one  in  version  1.  This  coincides  with  our  expectations  regarding 
the  effect  of  version  1  partitioning  on  the  primitive  metrics  factors.  That  is  to  say, 
version  1  reduces  the  strength  of  the  resulting  components  due  to  the  elimination  of 
the  parameters  from  the  routines  interface  which  were  replaced  with  global  variables. 
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The  validation  of  these  primitive  metrics  factors  shows  a  correlation  between  the 
primitive  metrics  factors  and  the  expected  complexity  associated  with  a  partitioning 
of  a  system.  We  conclude  that  the  complexity  of  a  partitioning  is  effectively  measured 
by  these  primitive  complexity  metrics  factors.  Different  complexities  will  result  in 
different  values  of  the  primitive  metrics  factors.  The  future  research  consists  of 
decribing  the  relationships  between  complexity  metrics  factors  in  terms  of  a  formula. 

5.4     The  Test  Cases:  Identified  Objects.  Clusters  and  Groups 

The  following  sections  present  the  partitionings  identified  in  the  three  test  case 
programs  using  two  modularization  techniques.  The  partitionings  consists  of  the 
identified  objects  using  the  top-down  analysis  method  of  the  object  finder  and  the 
clusters  identified  using  Hutchens  and  Basili  hierarchical  clustering  technique  [14]. 
A  generalized  view  of  both  kind  of  partitionings  consists  of  a  set  of  groups,  from 
Definition  17;  a  group  usually  corresponds  to  an  identified  object  in  the  object  finder 
and  to  a  cluster  in  hierarchical  clustering.  In  the  following  sections,  we  explain  how 
the  groups  based  on  identified  objects  and  on  cluster  were  specified  for  each  test  case. 

5.4.1     Test  Case  1:  Name  Cross-reference  Program 

The  first  test  case  consists  of  the  name  cross-reference  program  from  Section  8.1. 
The  statistics  of  this  example  are  given  in  Table  5.4.  This  section  presents  the  iden- 
tified objects  by  the  object  finder  using  the  top-down  analysis  method.  In  addition, 
we  present  the  clusters  identified  in  this  program  using  Basili's  hierarchical  clustering 
technique  [14]. 

The  identified  objects,  obtained  by  the  object  finder  during  the  top-down  anal- 
ysis method,  are  shown  in  Figure  8.1.  The  groups  corresponding  to  the  identified 
objects  in  Figure  8.1  are  derived  after  the  user  performs  some  modifications  on  the 


Table  5.4.  Statistics  of  the  test  case  programs. 
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Program  name 

Lines  of  code 

Global  variables 

Functions 

Types 

Name  cross- 
reference 

282 

1 

10 

10 

Algebraic  expression 
evaluation 

1324 

10 

50 

14 

Table 

management 

1,900 

8 

50 

4 

identified  objects.  The  purpose  of  the  user's  modifications  is  to  eliminate  the  com- 
monality between  objects  by  removing  common  components  between  objects,  as  ex- 
plained in  Section  8.1,  using  xobject,  which  results  in  the  objects  of  Figure  8.3;  these 
disjoint  objects  were  used  to  derive  the  groups.  The  groups  are  shown  in  Figure  5.6. 
Since  these  groups  are  based  on  objects  after  the  modifications,  we  name  a  group  by 
using  the  name  of  the  object  that  correspond  to  the  group. 

The  clusters  of  this  test  case  were  identified  by  a  clustering  tool  called  basili 
[21]  that  implements  Basili's  hierarchical  clustering  technique.  The  identified  clusters 
in  the  cross-reference  program  are  shown  in  Figure  5.7. 

Figure  5.7  showed  the  clusters  defined  on  this  example  using  Basili's  clustering 
techniques.  The  groups  corresponding  to  these  clusters  are  shown  in  Figure  5.8. 
In  this  case,  a  group  consists  of  one  or  more  clusters  of  routines  which  maintain  the 
same  levels  of  coupling  and  strength  [14]  of  the  cluster  identified  with  the  hierarchical 
clustering  technique.  The  groups  are  constructed  as  follows.  Initially,  groups  are 
defined  based  on  the  top  level  of  corresponding  clusters.  In  this  case,  the  naming 
of  groups  is  arbitrary  and  only  serves  to  distinctly  identify  each  group.  Then,  the 
routines  which  did  not  cluster  were  considered  to  be  groups  of  size  one.  An  alternative 
approach,  during  this  last  step,  is  to  group  together  the  unclustered  routines. 
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Group  z#objTLINEPTR#3  is  { 
~~LINEPTR~16  :(T)LINEPTR 
xref_tab~ II~make  linenode~138 

} 

Group  z#objTWTPTR*#4  is  { 
~ " WTPTR* "104  : (T ) WTPTR* 

xref _tab~ II~make_wordnode~140 

xref.tab" II~init_tab~102 

xref _tab~ II "addword" 64 

xref _tab~ II~add_lineno~61 

xref_out~_  II~writewords~181 

} 

Group  z#objTchar*#5  is  { 
~~char*~56 
xref  tab"   II~strsave~164 

} 

Group  z#objTiirt*#6   is  { 

~~int*~69    :(T)int* 

xref" II~main~137  : (R)main 

} 

Group  z#objGlineno#2  is  { 

xref _in~ II~lineno~130 

xref _in~ II~getword~98 

xref _in~ II~getachar~93 

} 


(R)make_linenode 


(R) make_wordnode 
(R)init_tab 
(R) addword 
(R)add_lineno 
(R)writewords 


(T)char* 
(R)strsave 


(G)lineno 

(R)getword 

(R)getachar 


Figure  5.6.  Groups  based  on  objects  identified  in  name  cross-reference  program 


Final  dendrogram: 
—  Cluster  No.l   -- 
(100    (50  add.lineno  addword) 
make_linenode 

(50  make_wordnode  strsave)    ) 
--   Cluster  No. 2   — 
(100  getachar  getword) 


Figure  5.7.  Clusters  found  in  the  name  cross-reference  program  by  basili 
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Group  I  is  { 
xref  _tab~__ 
xref  _tab~__ 
xref  _tab~_. 
xref _tab~__ 
xref _tab~_. 

> 

Group  II  is  { 

xref _in~ 

xref _in~ 

} 

Group   III    is   { 

xref  tab~_ 
} 
Group  IV  is  { 

xref _out~_. 
} 
Group  V  is  { 

xref" II' 

} 


_II~add_lineno~61 
.II~addword~64 
.II~make_wordnode~140 
.II~strsave~164 
II~make_linenode~138 


II~getword~98 
II~getachar~93 


II~init_tab~102 


II~writewords"181 


main" 137 


(R)add_lineno 

(R) addword 

(R) make_wordnode 

(R)strsave 

(R)make_linenode 


(R)getword 
(R)getachar 


(R)init_tab 

(R)writewords 

(R)main 


Figure  5.8.  Groups  based  on  clusters  found  in  name  cross-reference  program 

5.4.2     Test  Case  2:  Algebraic  Expression  Evaluation  Program 

The  second  test  case  consists  of  the  simple  algebraic  expression  evaluation  pro- 
gram from  Section  8.2.  The  statistics  of  this  program  are  listed  in  Table  5.4.  This 
section  presents  the  identified  objects  by  the  object  finder  during  the  top-down  anal- 
ysis method.  In  addition,  we  present  the  clusters  identified  in  this  program  using 
Hutchens  and  Basili  hierarchical  clustering  technique  [14]. 

The  identified  objects,  obtained  by  the  object  finder  during  the  top-down  anal- 
ysis method,  are  shown  in  Figures  8.4  and  8.5.  The  groups  corresponding  to  these 
identified  objects  are  derived  after  the  user  performs  modifications  on  the  identified 
objects.  Similarly  to  Section  5.4.1,  we  name  a  group  based  on  the  name  of  the  object 
that  corresponds  to  the  group.  The  groups  are  shown  in  Figures  5.9  and  5.10. 
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Group  z#objTN0DE*#6  is  { 
~~N0DE*~15 

function" II~Variable_eval~140 

function" II~Plus_eval~104 

function" II~Node_eval~101 

function" II~Multiply_eval~96 

function" II~Minus_eval"94 

function" II~Function_precedence~82 

function" II ~Function_delete_f unction 

function" II~Function_checkandadd~68 

function" II~Function_build_tree~66 

function" II~Function_add_operator~62 

function" II~Function_add_operands~60 

function" II~Echo_treeO~29 

function" II~Divide_eval~22 

function" II "Const ruct_f unction" 16 

function" II"Constant_eval~12 

function" II~Function_parenthesis~80 

function" II"Function_ovldop~73 

function" II~total_paren~292 

function" II "root "267 

function" II~queue~263 

function" II~last~216 

function~___II~first~193 


(T)NODE* 

(R)Variable_eval 

(R)Plus_eval 

(R)Node_eval 

(R) Mult  iply_eval 

(R)Minus_eval 

(R) Funct  ion_precedence 

(R) Funct  ion_delet e_f unct ion 

(R) Funct ion_checkandadd 

(R) Funct ion_build_tree 

(R) Funct ion_add_operator 

(R) Funct ion_add_operands 

(R)Echo_treeO 

(R)Divide_eval 

(R)Construct_f unct ion 

(R) Constant_eval 

(R) Funct ion_parenthesis 

(R) Funct  ion_ovldop 

(G)total_paren 

(G)root 

(G) queue 

(G)last 

(G)first 


Group  z#objTchar*#7  is  { 
~~char*~18 

state" II~State9_transition"128 

state" II~State7_transition~126 

state" II~State6_transition~124 

state" II~State5_transition"122 

state" II~State4_transition"120 

state" II~State3_transition~118 

state" II~State2_transition~116 

state" II~Statel_transition~114 

state" II~StateO_transition~112 

function"        II~Echo_tree~27 


(T) char* 

(R)State9 

(R)State7 

(R)State6 

(R)State5 

(R)State4 

(R)State3 

(R)State2 

(R)  St  at  el 

(R)StateO 

(R)Echo_t 


trans 
trans 
trans 
trans 
trans 
trans 
trans 
trans 
trans 
ree 


ition 
ition 
ition 
ition 
ition 
ition 
ition 
ition 
ition 


Figure  5.9.    Groups  based  on  types-based  objects  identified  in  algebraic  expression 
evaluation  program 
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Group  z#objTfloat#8   is   { 


~~float~5 

(T) float 

function" II~F6~52 

(R)F6 

function" II~F5~46 

(R)F5 

function~___II~F4~41 

(R)F4 

function" II~F3~37 

(R)F3 

function" II~F2~34 

(R)F2 

function"   II~F1~32 
} 
Group  z#objTint#9  is  { 

(R)F1 

~~int~4 

(T) int 

exprtst"   II~main~217 

} 

Group  z#objTvoid#10  is  { 

(R)main 

~~void~2 

(T)void 

function"   II~Destruct_function' 
} 

'21 

(R)Destr 

Figure  5.9  —  continued 


Similarly  to  Section  5.4.1,  the  clusters  of  this  test  case  were  identified  by  a 
clustering  tool  called  basili  [21]  based  on  Basili's  hierarchical  clustering  technique. 
The  identified  clusters  in  the  expression  evaluation  program  are  shown  in  Figure  5.11. 

Figure  5.11  showed  the  clusters  defined  in  this  example  using  Basili's  clustering 
techniques.  The  groups  corresponding  to  these  clusters  are  shown  in  Figure  5.12.  In 
this  test  case,  a  group  consists  of  one  or  more  clusters  and  sub- clusters  of  routines 
which  results  in  the  same  degree  of  coupling  and  strength  [14]  of  the  cluster  identified 
by  the  hierarchical  clustering  technique.  The  groups  are  constructed  as  follows.  First, 
a  group  corresponds  to  the  set  of  routines  which  form  sub-clusters  with  the  lowest 
coupling  and  highest  cohesion  between  the  routines  according  to  Hutchens  and  Basili 
[14]  definition  of  strength  and  coupling.  Next,  the  routines  which  did  not  cluster  were 
grouped  into  a  single  group,  named  group  V. 
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Group  z#objGtable_index#2  is  { 

symbol" II~table~284 

symbol" II~table_index~286 

symbol" II~Symbol_table_get_value~138 

symbol" II ~Symbol_table_get_ index" 136 

symbol" II~Symbol_table_clear~135 

symbol" II~Symbol_table_add_variable~ 

symbol" II~Symbol_table_add_value~130 

symbol" II~Echo_symbol_table~26 

symbol" I I "Construct _symbol_table~20 

> 

Group  z#objGparen_count#3  is  { 

state" II~paren_count~254 

state" II~StateO_inc_paren~lll 

state" II~StateO_get_paren~110 

state~_  _II~StateO_dec  paren~108 

} 

Group  z#objGexpression#4  is  { 

state" II~index~213 

state" II "express ion" 178 

state" II~StateO_get_char~109 

state" II~Construct_stateO~19 

function" II~Function_valid~85 


(G)table 

(G)table_index 

(R) Symbol_table_get_value 

(R) Symbol_table_get_index 

(R) Symbol_table_clear 

(R)Symbol_table_add_variable 

(R) Symbol_table_add_value 

(R) Echo_symbol_table 

(R) Construct_symbol_table 


(G)paren_count 
(R) StateO_inc_paren 
(R)StateO_get_paren 
(R) StateO_dec_paren 


(G) index 

(G) expression 

(R) StateO_get_char 

(R) Construct.stateO 

(R) Funct  ion_val id 


Figure  5.10.  Groups  based  on  globals-based  objects  identified  in  algebraic  expression 
evaluation  program 


Final  dendrogram: 
--  Cluster  No.l   -- 
(9  Function_ovldop 

Symbol_table_add_value) 

--  Cluster  No. 2  -- 
(100    (66  Construct_f unction 

(62  Function_add_operands 
(44  Funct ion_build_tree 

Symbol_table_add_variable) 
Function_checkandadd)    ) 
Echo_queue  Echo_tree0 
Funct ion_add_operat or 
Function_delete_function  main) 


Figure  5.11.  Clusters  found  in  algebraic  expression  evaluation  program 
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Group  I  is  { 

function" 

symbol" II 

} 

Group  II  is  { 

function" 

function" 

exprtst" I 

} 

Group  III  is  { 

function" 

function" 

function" 

} 

Group  IV  is  { 

function" 

symbol" II 

} 

Group  V  is  { 

function" 

function" 

function" 

function" 

function" 

function" 

function" 

function" 

function" 

function" 

state" II" 

state" II" 

state" II" 

state" II" 

state" II" 

state" II" 

state" II" 

state" II" 

state" II" 

function" 

function" 

function" 

function" 

function" 

function" 

function". 


II~Function_ovldop~73 
"Symbol_table_add_value~ 130 


II"Function_add_operator~62 
II ~Function_delete_f unction 
I~main~217 


II "Const ruct_f unction" 16 
II"Function_add_operands~60 
II~Function  checkandadd~68 


II "Funct ion_build_tree~66 
~Symbol_table_add_variable" 


II~Variable_eval~140 

II~Plus_eval"104 

II~Node_eval~101 

II~Multiply_eval~96 

II~Minus_eval~94 

II~Function_precedence~82 

II~Echo_treeO~29 

II~Divide_eval~22 

II"Constant_eval~12 

II  "Funct  ion_parenth.es  is~80 

State9_transition~128 

State7_transition~126 

State6_transition"124 

State5_transition~122 

Stat e4_transit ion" 120 

State3_transition~118 

State2_transition~116 

Statel_transition~114 

State0_transition"112 

II~Echo_tree~27 

II~F6~52 

II~F5~46 

II~F4~41 

II~F3~37 

II~F2~34 

II"F1~32 


(R) Funct  ion_ovldop 

(R) Symbol_table_add_value 


(R) Funct ion_add_operat or 
(R) Funct  ion_delet e_f unct ion 
(R)main 


(R)Construct_f unction 
(R) Funct ion_add_operands 
(R) Funct  ion_checkandadd 


(R) Funct ion_build_tree 
(R)Symbol_table_add_variable 


(R) Variable_eval 

(R)Plus_eval 

(R)Node_eval 

(R)Multiply_eval 

(R)Minus_eval 

(R) Funct ion_precedence 

(R)Echo_treeO 

(R)Divide_eval 

(R) Constant_eval 

(R) Funct  ion_parenthes  i  s 

(R) State9_transition 

(R) State7_transition 

(R)State6_transition 

(R) State5_transition 

(R) State4_transition 

(R)State3_transition 

(R)State2_transition 

(R)Statel_transition 

(R)StateO_transition 

(R)Echo_tree 

(R)F6 

(R)F5 

(R)F4 

(R)F3 

(R)F2 

(R)F1 


Figure  5.12.  Groups  based  on  clusters  found  in  algebraic  expression  evaluation  pro- 
gram 
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Group  V  (cont.)  is  { 

function" II~Destruct_function~21 

symbol" II~Symbol_table_get_value~138 

symbol" II ~Symbol_table_get_ index" 136 

symbol" II~Symbol_table_clear~135 

symbol" II~Echo_symbol_table~26 

symbol" I I "Construct _symbol_table~20 

state" II~StateO_inc_paren~lll 

state" II~StateO_get_paren~110 

state" II~StateO_dec_paren~108 

state" II~StateO_get_char~109 

state" II~Construct_stateO~19 

function" II~Function  valid~85 

} 


(R)Destruct_f unction 

(Ft)  Symbol_table_get_value 

(R) Symbol_table_get_index 

(R) Symbol_table_clear 

(R) Echo_symbol_table 

(R) Construct_symbol_table 

(R)StateO_inc_paren 

(R)StateO_get_paren 

(R) StateO_dec_paren 

(R)StateO_get_char 

(R)Construct_stateO 

(R) Funct  ion_val id 


Figure  5.12  —  continued 


5.4.3     Test  Case  3:  Symbol  Table  Management  for  Acx 

The  third  test  case  consists  of  the  symbol  table  management  module  of  an  ANSI 
"C"  parser  tool.  The  statistics  of  this  program  are  listed  in  Table  5.4.  This  section 
presents  the  identified  objects  by  the  object  finder  during  the  top-down  analysis 
method.  In  addition,  we  present  the  clusters  identified  in  this  program  using  Hutchens 
and  Basili  hierarchical  clustering  technique  [14]. 

The  identified  objects,  obtained  by  the  object  finder  during  the  top-down  anal- 
ysis method,  are  shown  in  Figures  5.13  and  5.14.  The  groups  corresponding  to  these 
identified  objects  are  derived  after  the  user  performs  modifications  on  the  identified 
objects.  Similarly  to  Section  5.4.1,  we  name  a  group  based  on  the  name  of  the  object 
that  correspond  to  the  group.  The  groups  are  shown  in  Figures  5.15  and  5.16. 

Similarly  to  Section  5.4.1,  the  clusters  of  this  test  case  were  identified  by  a 
clustering  tool  called  basili  [21]  based  on  the  hierarchical  clustering  technique.  The 
identified  clusters  in  the  symbol  table  management  program  are  shown  in  Figure  5.17. 
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Qbj 


} 

Qbj 


} 

Qbj 


} 
Obj 


ect  z#objTFILE*#3  is  { 
~~FILE*~124 

table" II~output_usage"283 

table" II~output_table~271 

ect  z#objTSymbol*#4  is  { 
~~Symbol*~64 

table" II~where_defined~448 

table" II~symbol_dup~361 

table" II ~new_ symbol "251 

table" II~move_use~244 

table" II~move_reference~242 

table" II~move_def "237 

table" II~merge_type~234 

table" II~make_undef ined_syrabol~211 

table" II~make_symbol~209 

table" II~ismemberof "194 

table" II"isdefined~192 

table" II~is_type_alias~190 

table" II~is_macro_name~188 

table" II~insert_use~184 

table" II~insert_symbol~182 

table" II~insert_ref erence~180 

table" II~insert_def "178 

table" II ~get_top_ symbol" 149 

table" II~get_entry_string~145 

table" II~get_default~144 

table" II~find_symbol~132 

table" II~add_type~63 

ect  z#objTToken*#5  is  { 
~ "Token* "95 

table" II~token_dup~416 

table" II~token_cmp~413 

table" II~new_token~252 

table~___II~isthere~197 

ect  z#objTType*#6  is  { 
~~Type*~254 

table" II~type_dup~429 

table" II~type_cat~426 

table" II~new_type~253 


(T)FILE* 

(R) output_usage 

(R) output_table 


(T) Symbol* 

(R)where_def ined 

(R) symbol_dup 

(R)new_symbol 

(R)move_use 

(R)move_ref erence 

(R)move_def 

(R)merge_type 

(R) make_undef ined_ symbol 

(R)make_ symbol 

(R)ismemberof 

(R)isdef ined 

(R) is_type_alias 

(R) is_macro_name 

(R) insert _use 

(R) insert_symbol 

(R) insert_ref erence 

(R)insert_def 

(R)get_top_symbol 

(R) get_entry_string 

(R)get_default 

(R)f ind_symbol 

(R)add_type 


(T) Token* 

(R)token_dup 

(R)token_cmp 

(R)new_token 

(R)isthere 


(T)Type* 
(R)type_dup 
(R)type_cat 
(R)new_type 


Figure  5.13.   Types-based  candidate  objects  identified  in  symbol  table  management 
program 
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Object  z#objTchar*#7  is  { 

~~char*~58 

table" II~get_type_name~151 

table" II "get _table_ index" 147 

} 

Object  z#objTentry_tag*#8  is  { 

~~entry_tag*~250 

table"  II~new_entry~249 

} 

Object  z#objT+undetermined-type#9  is  { 

~~symbo_tag*~208 

~~Type*~254 

~~Token*~95 

~~Symbol*~64 

~~FILE*~124 

table" II" output _use~ 285 

table" II~output_type~276 

table" II~output_type_member~280 

table" II~output_token~273 

table" II "output _reference~268 

table" II "output _parameter~262 

table" II~output_parameter_usage~265 

table" II ~ output _def "259 

table" II~output_declaration_type~256 

table"   II~make_name~205 
> 


(T) char* 

(R)get_type_name 
(R) get_table_index 


(T) entry _tag* 
(R)new_ entry 


(T)symbo_tag* 

(T)Type* 

(T) Token* 

(T) Symbol* 

(T)FILE* 

(R) output _use 

(R) output _type 

(R) output_type_member 

(R) output_token 

(R) output_ref erence 

(R) output_parameter 

(R) output_parameter_usage 

(R)output_def 

(R)output_declaration_type 

(R)make_name 


Figure  5.13  —  continued 
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Object  z#obj 

table" 

table" 

table" 

table" 

table" 

table" 

table" 

table" 

table" 

table" 

table~___ 

table" 

table" 

table" 
table" 
table" 
table" 
table" 
table" 
table" 
table" 
table" 
table" 
table" 
table" 
table" 
table" 
table" 

} 


Gcurrent_count#2  is  { 

II~use_list~446 

II~symbol_table~363 

II~goto_ref erence  list~157 

II~def~98 

II~debug_token~94 

II "current _side~92 

II "current _scope~91 

II "current _func_name"90 

II"current_count~89 

II~where_defined~448 

II~table_reset~382 

II~table_init~381 

II~table_final~380 

II~remove_scope_flag~310 

1 1 ~pr _debug_t  oken" 2  94 

II~output_usage~283 

II~output_table"271 

II~make_symbol~209 

II~make_name~205 

II~isdefined~192 

II"is_type_alias~190 

II~is_macro_name~188 

II~insert_use~184 

II ~ insert _ symbol "182 

II" insert_ref erence" 180 

II~insert_def~178 

II"flag_scope_flag~136 

II~find_symbol~132 


(G)use_list 

(G) symbol_table 

(G)goto_ref erence_list 

(G)def 

(G) debug_token 

(G) current_side 

(G) current_scope 

(G) current_f unc_name 

(G) current_count 

(R) where_def ined 

(R)table_reset 

(R)table_init 

(R)table_final 

(R) remove_scope_f lag 

(R)pr_debug_token 

(R) output_usage 

(R) output_table 

(R)make_ symbol 

(R)make_name 

(R)isdef ined 

(R) is_type_alias 

(R) is_macro_name 

(R)insert_use 

(R) insert_symbol 

(R) insert_ref erence 

(R)insert_def 

(R) f lag_scope_f lag 

(R)f ind_symbol 


Figure  5.14.  Globals-based  candidate  objects  identified  in  symbol  table  management 
program 
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Group  z#objTFILE*#3  is  { 
~~FILE*~124 

table" II~output_usage~283 

table" II"output_table~271 

Group  z#objTSymbol*#4  is  { 

~~Symbol*~64 

table" II~symbol_dup~361 

table" II~new_symbol~251 

table" II~move_use~244 

table" II~move_ref erence~242 

table" II~move_def~237 

table" II~merge_type~234 

table" II~make_undef ined_symbol~211 

table" II~make_symbol~209 

table" II~ismemberof "194 

table" II~isdefined~192 

table" II~is_type_alias~190 

table" II~is_macro_name~188 

table" II~insert_use~184 

table" II ~ insert _ symbol "182 

table" II~insert_ref erence~180 

table" II~insert_def~178 

table" II~get_top_symbol~149 

table" II "get _entry_ string" 145 

table~___II~get_default~144 

table" II~find_symbol~132 

table"  __II~add_type~63 
} 
Group  z#objTToken*#5  is  { 

~~Token*~95 

table" II~token_dup~416 

table" II~token_cmp~413 

table" II~new_token~252 

table" II~isthere~197 


} 


{ 


Group  z#objTType*#6  is 
~~Type*~254 

table" II~type_dup~429 

table" II~type_cat~426 

table"   II~new  type~253 

} 


(T)FILE* 

(R) output _usage 

(R) output_table 


(T) Symbol* 

(R) symbol_dup 

(R)new_ symbol 

(R)move_use 

(R)move_ref erence 

(R)move_def 

(R)merge_type 

(R)make_undefined_ symbol 

(R)make_symbol 

(R)ismemberof 

(R)isdef ined 

(R) is_type_alias 

(R) is_macro_name 

(R)insert_use 

(R) insert_symbol 

(R) insert_ref erence 

(R)insert_def 

(R) get_top_symbol 

(R)get_entry_string 

(R)get_default 

(R)f ind_symbol 

(R) add_type 


(T) Token* 

(R)token_dup 

(R)token_cmp 

(R)new_token 

(R)isthere 


(T)Type* 
(R)type_dup 
(R)type_cat 
(R)new_type 


Figure  5.15.    Groups  based  on  types-based  objects  identified  in  symbol  table  man- 
agement program 
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Group  z#objTchar*#7  is  { 
char*~58 

table" II~get_type_name~151 

table" II~get_table_ index" 147 

} 

Group  z#objTentry_tag*#8  is  { 

~~entry_tag*~250 

table"  II"new_entry~249 

> 

Group  z#objT+undetermined-type#9  is  { 

~~symbo_tag*~208 

~~Type*~254 

~~Token*~95 

~~Symbol*~64 

~~FILE*~124 

table" II~output_use~285 

table" II~output_type"276 

table" II~output_type_member~280 

table" II~output_token~273 

table" II~output_ref erence~268 

table" II~output_parameter~262 

table" II~output_parameter_usage~265 

table" II~output_def"259 

table" II~output_declaration_type~256 


(T)char* 

(R) get_type_name 

(R)get_table_ index 


(T)entry_tag* 
(R)new_ entry 


(T)symbo_tag* 

(T)Type* 

(T) Token* 

(T) Symbol* 

(T)FILE* 

(R)output_use 

(R) output_type 

(R) output_type_member 

(R) output_token 

(R) output_ref erence 

(R) output_parameter 

(R) output_parameter_usage 

(R)output_def 

(R) output_declaration_type 


Figure  5.15  —  continued 
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Group  z#objGcurrent_count#2  is  { 

table" II~use_list~446 

table" II~symbol_table~363 

table" II~goto_ref erence_list~157 

table~___II"def~98 

table" II~debug_token~94 

table" II "current. side" 92 

table" II "current _scope"91 

table" II "current _func_name~ 90 

table" II "current _count "89 

table" II~where_defined~448 

table" II~table_reset~382 

table" II~table_init"381 

table" II~table_final~380 

table" II~remove_scope_flag~310 

table" II~pr_debug_token~294 

table" II~make_name~205 

table" II~flag_scope_flag~136 


(G)use_list 

(G) symbol_table 

(G)goto_ref erence_list 

(G)def 

(G ) debug_t  oken 

(G)current_side 

(G) current_scope 

(G) current _func_name 

(G) current_count 

(R)where_defined 

(R)table_reset 

(R)table_init 

(R)table_f inal 

(R) remove_scope_f lag 

(R) pr_debug_token 

(R)make_name 

(R)flag_scope_f lag 


Figure  5.16.  Groups  based  on  globals-based  objects  identified  in  symbol  table  man- 
agement program 

Figure  5.17  showed  the  clusters  defined  on  this  example  using  Basili's  clustering 
techniques.  The  groups  corresponding  to  these  clusters  are  shown  in  Figure  5.18.  In 
this  case,  a  group  consists  of  one  or  more  clusters  of  routines  that  maintain  the 
same  levels  of  coupling  and  strength  [14]  of  the  cluster  identified  with  hierarchical 
clustering  technique.  The  groups  are  similarly  constructed  to  the  groups  for  the 
clusters  identified  in  the  algebraic  expression  evaluation  program  of  Section  5.4.2. 
Initially,  groups  are  defined  based  on  the  top  level  of  the  corresponding  clusters. 
Then,  the  routines  which  did  not  cluster  were  grouped  into  a  single  group,  named 
group  V. 

5.5     Comparison  of  Complexity 

This  section  presents  the  results  of  the  comparison  of  the  three  test  cases  in 
study  I  and  a  summary  of  the  results  for  all  three  test  cases.  The  test  cases  are  the 
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Final  dendrogram: 
~  Cluster  No.l  — 
(3  merge_type 

type_cat 

out  put  _p  aramet  er 

output_table 

output .reference 

output _type_member) 

--  Cluster  No. 2  — 
(14  get _top_ symbol 
symbol_dup 
token_dup 
type_dup 
(8  insert_def 

move_def ) 
table_f inal 
(8  insert_use 

move_use) 
(6  isthere 

move_ref erence) 
output_def 

output _parameter_usage 
output _usage 
output _use) 


~  Cluster  No. 3  -- 
(100  (50  find.symbol 

insert _symbol 

get_entry_string 

get_table_index) 

(38  insert .reference 
ismemberof ) 

get_type_name 

isdef ined 

make_undef ined_symbol 

where_def  ined) 


Figure  5.17.  Clusters  found  in  symbol  table  management  program 
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Group  I  is  { 

table" 

___II 

table" 

___II 

table" 

___II 

table" 

___II 

table" 

___II 

table" 

___II 

} 

Group  II 

is  { 

table" 

___H 

table" 

___II 

table" 

___II 

table" 

___II 

table" 

___II 

table" 

___II 

table" 

___ II 

table" 

___II 

table" 

—II 

table" 

___II 

table" 

___II 

table" 

___II 

table" 

—II 

table" 

—  II 

table" 

—II 

} 

Group  III 

is  { 

table" 

—II 

table" 

—II 

table" 

__.II 

table" 

—II 

table" 

—II 

table" 

—II 

table" 

—II 

table" 

—II 

table" 

—II 

table" 

—II 

'merge_type~234 

'type_cat~426 

'output_parameter~262 

'output_table~271 

'output_ref erence~268 

'output_type_member~280 


"get_top_ symbol" 149 

'symbol_dup~361 

'token_dup~416 

'type_dup~429 

'  insert _def~ 178 

"move_def "237 

"table_final~380 

'insert_use~184 

'move_use~244 

'isthere~197 

'move_ref erence~242 

"output_use~285 

"output_parameter_usage~265 

'output_usage~283 

'output_def "259 


'find_ symbol "132 

'  insert _ symbol "182 

'get_entry_string~145 

'get_table_index~147 

'  insert .reference" 180 

'ismemberof "194 

'get_type_name"151 

'isdef ined~192 

'make_undef ined_symbol~211 

'where  defined~448 


(R)merge_type 

(R)type_cat 

(R) output_parameter 

(R) output_table 

(R) output_ref erence 

(R) output _type_member 


(R)get_top_ symbol 

(R) symbol_dup 

(R)token_dup 

(R)type_dup 

(R) insert_def 

(R)move_def 

(R)table_final 

(R)insert_use 

(R)move_use 

(R)isthere 

(R) move_ref erence 

(R)output_use 

(R) output_parameter_usage 

(R) output_usage 

(R)output_def 


(R)f ind_symbol 
(R) insert_symbol 
(R)get_entry_string 
(R) get _t  abl e_  index 
(R) insert_ref erence 
(R) ismemberof 
(R) get_type_name 
(R) isdef ined 

(R)make_undef ined_symbol 
(R) where_def ined 


Figure  5.18.  Groups  based  on  clusters  found  in  symbol  table  management  program 
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Group  IV 

is  { 

table" 

___II 

table" 

___II 

table" 

___II 

table" 

___II 

table" 

___II 

table" 

___II 

table" 

___II 

table" 

...II 

table" 

___II 

table" 

___II 

table" 

...II 

table" 

...II 

table" 

___II 

table" 

...II 

table" 

...II 

table" 

___II 

table" 

...II 

table" 

...II 

table" 

._II 

"new_symbol~251 

"make_symbol~209 

'is_type_alias~190 

'is_macro_name~188 

"get_default~144 

'add_type~63 

"token_cmp~413 

'new_token~252 

'new_type~253 

"new_entry~249 

'  output _type~276 

'  output _t  oken" 273 

'output_declaration_type~256 

'table_reset~382 

'table_init~381 

'remove_scope_flag~310 

'pr_debug_token~294 

'make_name~205 

'flag_scope_flag~136 


(R)new_ symbol 

(R)make_symbol 

(R) is_type_alias 

(R) is_macro_name 

(R)get_def ault 

(R) add_type 

(R)token_cmp 

(R)new_token 

(R)new_type 

(R)new_ entry 

(R) output_type 

(R) output_token 

(R) output.declarat ion.type 

(R) table_reset 

(R)table_init 

(R)remove_scope_flag 

(R) pr _debug_t  oken 

(R)make_name 

(R)f lag_scope_f lag 


Figure  5.18  —  continued 
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name  cross-reference  program  of  Section  5.4.1,  the  algebraic  expression  evaluation  of 
Section  5.4.2,  and  the  symbol  table  management  program  of  Section  5.4.3. 

The  comparison  methodology  consisted  of  comparing  the  complexity  of  two 
partitionings  of  each  test  case,  as  expressed  in  terms  of  the  primitive  complexity 
metrics  factors.  Then,  the  summary  of  all  three  test  cases  is  presented  factor  by 
factor  at  the  end  of  this  section. 

First,  the  primitive  factors  were  computed  for  each  partitioning  of  a  test  case 
program.  A  partitioning  is  due  to  a  particular  modularization  approach;  for  this 
study,  we  considered  two  approaches  to  obtain  the  partitioning:  our  object  finder 
and  the  hierarchical  clustering  technique  by  Hutchens  and  Basili  [14].  Hierarchical 
clustering  was  chosen  for  the  study  since  this  modularization  technique  is  also  based 
on  data  bindings  found  in  a  program;  their  approach  is  one  of  several  modularization 
(clustering)  techniques  which  were  explained  in  Section  2.1.1.1.  Each  partitioning 
consists  of  a  set  of  groups  derived  in  Section  5.4.  Second,  the  comparison  of  the 
complexity  of  the  two  partitionings  of  a  test  case  consists  of  comparing  one  complexity 
metrics  factor  at  a  time. 

The  results  are  presented  for  each  test  case  individually.  We  computed  the 
primitive  complexity  factors  for  the  groups  abstracted  from  the  identified  objects 
and  also  for  the  groups  abstracted  from  the  clusters  defined  using  Basili 's  clustering 
technique.  Then,  we  illustrate  a  factor  by  factor  interpretation  of  the  observed  results 
for  each  test  case.  At  the  end  of  this  section,  we  present  an  overall  interpretation 
of  the  study  which  consists  of  summarizing  the  results  of  the  three  test  cases.  This 
interpretation  of  the  summary  results  demonstrates  that  the  object  finder  approach 
results  in  a  modularization  of  a  program  which  exhibits  significantly  lower  inter-group 
complexity  and  moderately  lower  intra-group  complexity  compared  to  hierarchical 
clustering  techniques. 
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5.5.1     Test  Case  1:  Name  Cross-reference  Program 


This  section  presents  the  comparison  of  the  metrics  factors  computed  for  both 
versions  of  the  name  cross-reference  program.  Version  1  is  the  partitioning  (groups) 
derived  using  the  objects  finder  and  version  2  is  the  partitioning  (groups)  derived 
using  hierarchical  clustering  techniques. 

First,  the  comparison  of  inter-group  complexity  metrics. 

•  Factors  /i  and  /2,  the  sets  of  direct  interface  variables  and  indirect  interface 
variables,  respectively,  are  not  used  in  this  comparison  since  there  are  few  (only 
one)  global  variables  to  consider  in  the  program. 

•  Factor  /3,  the  boundary  set,  shows  the  similar  inter-group  complexity  observed 
among  the  groups  derived  from  both  clustering  techniques  since  the  average 
number  of  data  bindings  in  version  1  was  3  whereas  the  average  number  of  data 
bindings  in  version  2  was  3.5.  On  the  other  hand,  the  inter-group  complexity 
of  the  interface  can  be  enhanced  by  the  object  finder  as  shown  by  the  reduced 
number  of  global  variables  in  the  boundary  set  shown  by  factor  /4  with  a  total 
of  10  variables  in  version  1  and,  12  variables  in  version  2. 

•  Factors  f5  and  /6,  the  sum  of  type  sizes  of  direct  and  indirect  interface  variables, 
were  not  used  in  this  comparison  since  there  are  few  (only  one)  global  variables 
to  consider  in  this  program. 

•  Factors  f?  and  /g,  the  sum  of  the  type  sizes  of  boundary  set  variables  and 
the  set  of  types  of  boundary  set  variables,  respectively,  showed  that  on  the 
average  groups  derived  from  Basili's  clustering  techniques  tend  to  exhibit  a 
more  complex  interface  since  their  routines  "manipulate"  a  larger  set  of  types. 
From  factor  /7,  we  determined  the  total  type  size  of  program  version  1  and 
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2  to  be  30  and  98,  respectively.  Also,  from  factor  /$,  the  number  of  types 
manipulated  by  the  interface  of  groups  in  version  2  was  larger  than  the  number 
of  types  manipulated  by  the  interface  of  groups  in  version  1.  Thus,  the  inter- 
group  complexity  was  larger  between  the  groups  of  version  2. 

Second  the  intra-group  complexity  factors  are  analyzed. 

•  Factors  /g  and  /10,  the  set  of  direct  internal  variables  and  indirect  internal 
variables,  respectively,  were  not  used  in  this  comparison  since  there  were  few 
(only  one)  global  variables  to  consider  in  the  program. 

•  Factor  /n,  the  interior  set,  showed  similar  intra-group  complexity  between  the 
groups  in  both  versions  of  the  program  as  observed  from  the  number  of  bindings 
in  versions  1  and  2.  Also,  factor  /i2,  the  percentage  of  modules  accessing  all 
global  variables,  illustrated  the  similarity  of  intra-group  complexity  between 
the  two  program  versions. 

•  Factors  /13  and  /i4,  the  sum  of  type  sizes  of  interior  set  variables  and  the  set 
of  types  of  interior  set  variables,  respectively,  reflected  the  larger  intra-group 
complexity  observed  in  the  groups  derived  from  Basili's  clustering  techniques; 
specifically,  from  factor  /13,  we  determined  that  the  total  type  size  of  program 
version  1  and  2  were  26  and  50,  respectively.  Also,  from  factor  /i4,  the  number 
of  types  manipulated  by  routines  in  groups  of  version  1  was  equal  to  the  number 
of  types  manipulated  by  routines  in  groups  of  version  2.  Even  though  a  similar 
set  of  types  was  manipulated  by  groups  in  each  version,  the  group's  intra-group 
complexity  was  different  due  to  the  frequency  of  use  of  those  types. 

Finally,  the  intra-group  strength  was  not  considered  in  the  comparison  since 
Basili's  clustering  technique  does  not  use  type  information  during  the  clustering. 
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This  test  case  shows  some  of  the  advantages  of  using  the  object  finder  to  modu- 
larize a  system.  The  resulting  partitioning  exhibited  significantly  reduced  inter-group 
complexity  and  reduced  intra-group  complexity  when  compared  to  the  partitioning 
resulting  from  hierarchical  clustering. 

5.5.2     Test  Case  2:  Algebraic  Expression  Evaluation  Program 

This  section  presents  the  comparison  of  the  metrics  factors  computed  for  both 
versions  of  the  algebraic  expression  evaluation  program.  Version  1  is  the  partitioning 
(groups)  derived  using  the  objects  finder  and  version  2  is  the  partitioning  (groups) 
derived  using  hierarchical  clustering  techniques. 

First,  the  comparison  of  inter-group  complexity  metrics. 

•  Factors  /i  and  /2,  the  sets  of  direct  interface  variables  and  indirect  interface 
variables,  respectively,  show  the  reduced  inter-group  complexity  of  the  identified 
objects  in  terms  of  the  number  of  interface  variables  since  version  1  had  no 
interface  variable  and  version  2  had  an  average  of  three  interface  variables 
between  groups. 

•  Factor  /3,  the  boundary  set,  illustrates  the  reduced  inter-group  complexity 
between  the  groups  of  version  1  compared  to  the  groups  of  version  2  in  terms  of 
the  number  of  data  bindings  between  groups  since  the  average  number  of  data 
bindings  between  the  groups  in  version  1  was  9.8  data  bindings  whereas  between 
groups  in  version  2  was  23.4  data  bindings.  That  is  also  the  case  for  factor  /4, 
the  set  of  variables  in  the  boundary  set.  The  average  number  of  variables  in  the 
boundary  set  for  version  1  was  7.8  variables  whereas  the  average  for  version  2 
was  15. 
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•  Factors  /g  and  fe,  the  sum  of  type  sizes  of  direct  and  indirect  interface  variables, 
showed  the  reduced  the  inter-group  complexity  of  the  identified  objects  in  terms 
of  the  type  size  of  the  interface  variables  since  version  1  type  size  was  zero  and 
version  2  had  a  maximum  type  size  of  106  between  its  groups.  This  showed  the 
reduced  complexity  due  to  the  interface  variables. 

•  Factor  fj%  the  sum  of  type  sizes  of  boundary  set  variables,  showed  the  reduced 
inter-group  complexity  of  the  identified  objects  in  terms  of  the  total  type  size 
of  the  variables  defining  data  bindings  between  groups  since  the  total  type  size 
of  version  1  was  63  whereas  the  total  type  size  of  version  2  was  277. 

•  Factor  f8,  the  set  of  types  of  boundary  set  variables,  showed  the  reduced  inter- 
group  complexity  of  the  identified  objects  in  terms  of  the  number  of  different 
types  manipulated  by  the  interface  of  groups.  On  the  average,  the  number  of 
types  manipulated  by  the  interface  of  groups  in  version  1  was  slightly  less  than 
the  number  of  types  manipulated  by  the  interface  of  groups  in  version  2. 

Second  the  intra-group  complexity  factors  are  analyzed. 

•  Factors  fg  and  fio,  the  set  of  direct  internal  variables  and  indirect  internal 
variables,  respectively,  showed  the  increased  intra-group  complexity  in  the  i- 
dentified  objects  in  terms  of  the  number  of  internal  variables  since  version  1 
had  more  internal  variables  inside  the  groups  than  version  2.  However,  this  is 
a  signal  of  increased  strength  of  the  groups  in  version  1 . 


•  Factor  /11?  the  interior  set,  showed  the  slightly  increased  intra-group  complex- 
ity of  the  identified  objects  in  terms  of  the  number  of  data  bindings  inside  the 
groups  since  the  maximum  number  of  data  bindings  inside  the  groups  in  ver- 
sion 1  was  55  whereas  the  maximum  number  of  data  bindings  inside  groups 
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in  version  2  was  30.    Thus,  the  intra-group  complexity  was  slightly  larger  in 
version  1  than  in  version  2. 

•  Factor  fu,  the  percentage  of  modules  accessing  all  global  variables,  was  not 
used  in  this  comparison  since  it  can  only  be  computed  for  those  groups  derived 
from  globals-based  candidate  objects. 

•  Factor  /13,  the  sum  of  type  sizes  of  interior  set  variables,  showed  slightly  in- 
creased intra-group  complexity  of  the  identified  objects  in  terms  of  the  total 
type  size  of  program  version  1  and  2  since  the  total  type  size  of  version  1  was 
326  whereas  the  total  type  size  of  version  2  was  226. 

•  Factor  fu,  the  set  of  types  of  interior  set  variables,  showed  similar  intra-group 
complexity  inside  the  groups  in  both  versions  of  the  program  as  observed  from 
the  number  of  different  types  manipulated  by  routines  in  groups  of  version  1  and 
the  number  of  different  types  manipulated  by  routines  in  groups  of  version  2. 

Finally,  the  intra-group  strength  factors  were  not  considered  in  the  comparison 
since  Basili's  clustering  technique  does  not  use  type  information  during  the  clustering. 

5.5.3     Test  Case  3:  Symbol  Table  Management  Program  for  Acx 

This  section  presents  the  comparison  of  the  metrics  factors  computed  for  both 
versions  of  the  symbol  table  management  program.  Version  1  is  the  partitioning 
(groups)  derived  using  the  objects  finder  and  version  2  is  the  partitioning  (groups) 
derived  using  hierarchical  clustering  techniques. 

First,  the  comparison  of  inter-group  complexity  metrics. 
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•  Factors  /i  and  /jj,  the  sets  of  direct  interface  variables  and  indirect  interface 
variables,  respectively,  show  the  increased  inter-group  complexity  of  the  identi- 
fied objects  since  the  number  of  interface  variables  in  version  1,  equal  to  3,  was 
larger  than  the  number  of  interface  variables  in  version  2,  equal  to  1. 

•  Factor  /3,  the  boundary  set,  shows  the  significantly  reduced  inter-group  com- 
plexity between  the  groups  in  version  1  compared  to  the  inter-group  complexity 
of  groups  in  version  2  in  terms  of  the  number  of  data  bindings  between  groups 
since  the  average  number  of  data  bindings  between  the  groups  in  version  1  is 
26.6  data  bindings  whereas  between  the  groups  in  version  2  is  41  data  bind- 
ings. However,  the  maximum  number  of  data  bindings  was  much  larger  (80)  in 
version  1  than  in  version  2  (51).  The  inter-group  complexity  was  also  reduced, 
as  illustrated  by  factor  /4,  the  set  of  variables  in  the  boundary  set.  The  av- 
erage number  of  variables  in  the  boundary  set  for  version  1  was  13.2  whereas 
version  2's  average  was  26.7. 

•  Factors  /5  and  /6,  the  sum  of  type  sizes  of  direct  and  indirect  interface  variables, 
showed  the  slightly  increased  inter-group  complexity  of  the  identified  objects  in 
terms  of  the  type  size  of  interface  variables  since  the  average  sum  of  type  sizes 
of  interface  variables  between  groups  in  version  1  (56.3)  was  slightly  higher  than 
the  average  sum  of  type  sizes  of  interface  variables  between  groups  in  version  2 
(55). 

•  Factor  f7,  the  sum  of  type  sizes  of  boundary  set  variables,  showed  the  reduced 
inter-group  complexity  of  the  identified  objects  in  terms  of  the  total  type  size 
between  the  groups  since  the  total  type  size  of  version  1  was  267  and  the  total 
type  size  of  version  2  was  323. 
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•  Factor  /8,  the  set  of  types  of  boundary  set  variables,  showed  the  reduced  inter- 
group  complexity  of  the  identified  objects  in  terms  of  the  number  of  different 
types  manipulated  by  the  interface  of  groups.  On  the  average,  the  number  of 
types  manipulated  by  the  interface  of  groups  in  version  1  (3.25)  was  less  than 
the  number  of  types  manipulated  by  the  interface  of  groups  in  version  2  (6.2). 

Second  the  intra-group  complexity  factors  are  analyzed. 

•  Factors  f9  and  /i0,  the  set  of  direct  internal  variables  and  indirect  internal  vari- 
ables, respectively,  showed  similar  intra-group  complexity  inside  the  groups  in 
both  versions  of  the  program  as  observed  from  the  number  of  internal  variables 
in  version  1  and  2. 

•  Factor  /u,  the  interior  set,  showed  reduced  intra-group  complexity  of  the  iden- 
tified objects  in  terms  of  the  number  of  data  bindings  inside  groups  since  the 
average  number  of  data  bindings  in  the  groups  of  version  1  (10.7)  was  more 
than  half  the  average  number  of  data  bindings  in  the  groups  of  version  2  (26.5). 

•  Factor  /12,  the  percentage  of  modules  accessing  all  global  variables,  was  not 
used  in  this  comparison  since  it  can  only  be  computed  for  those  groups  derived 
from  globals-based  candidate  objects. 

•  Factor  /13,  the  sum  of  type  sizes  of  interior  set  variables,  showed  similar  intra- 
group  complexity  inside  the  groups  since  the  total  type  size  of  program  version  1 
and  2  was  very  close;  specifically,  the  total  type  size  of  version  1  was  157  and 
the  total  type  size  of  version  2  was  167. 

•  Factor  /i4,  the  set  of  types  of  interior  set  variables,  showed  similar  intra-group 
complexity  inside  the  groups  in  both  versions  of  the  program  as  observed  from 
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the  number  of  different  types  manipulated  by  routines  in  groups  of  version  1  and 
the  number  of  different  types  manipulated  by  routines  in  groups  of  version  2. 

Finally,  the  intra-group  strength  factors  were  not  considered  in  the  comparison 
since  Basili's  clustering  technique  does  not  use  type  information  during  the  clustering. 

5.5.4     Summary  and  Conclusions  of  the  Comparison 

This  section  presents  the  summary  of  the  three  test  cases  presented  in  the  pre- 
vious sections  for  each  of  the  primitive  factors  of  complexity  which  affects  inter-group 
and  intra-group  complexity.  We  assume  that  version  1  consists  of  the  partitioning 
(groups)  resulting  from  the  object  finder,  i.e.,  the  identified  objects,  and  version  2 
corresponds  to  the  partitioning  (groups)  resulting  from  hierarchical  clustering,  i.e.: 
the  clusters. 

The  factors  for  inter-group  complexity  are: 

•  Factors  fi  and  /2,  the  sets  of  direct  interface  variables  and  indirect  interface 
variables,  respectively,  included  a  test  case  in  which  the  inter-group  complexity 
was  higher  in  version  1  than  in  version  2,  and  another  test  case  in  which  the 
inter-group  complexity  was  higher  in  version  2  than  in  version  1;  this  factor  was 
ignored  in  the  third  test  case.  Since  the  differences  in  the  values  of  these  factors 
were  not  significant,  we  concluded  that  both  versions  had  similar  inter-group 
complexity  based  on  factors  f\  and  /2  in  all  three  test  cases. 

•  Factor  /3,  the  boundary  set,  showed  significantly  reduced  inter-group  com- 
plexity between  the  groups  in  one  test  cases  and,  at  least,  somewhat  reduced 
inter-group  complexity  in  another  test  cases.  This  factor  demonstrated  that 
the  object  finder-based  groups  presented  lower  inter-group  complexity  than  hi- 
erarchical clustering-based  groups. 
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•  Factor  f4,  the  set  of  variables  in  the  boundary  set,  showed  that  the  object  finder 
significantly  reduced  the  inter-group  complexity  between  the  groups  in  all  three 
test  cases.  In  two  test  cases,  the  average  number  of  variables  in  the  boundary 
set  was  reduced  by  half. 

•  Factors  /§  and  fe,  the  sum  of  type  sizes  of  direct  and  indirect  interface  variables, 
showed  significantly  reduced  inter-group  complexity  in  the  identified  objects  in 
one  of  the  test  cases.  However,  in  another  test  case,  the  inter-group  complexity 
slightly  increased  in  the  groups  derived  from  identified  objects.  This  factor 
showed  the  reduced  inter-group  complexity  of  the  identified  objects. 

•  Factor  /V,  the  sum  of  type  sizes  of  boundary  set  variables,  showed  significantly 
reduced  inter-group  complexity  of  the  identified  objects  in  terms  of  the  total 
type  size  of  the  variables  in  the  boundary  set  between  groups  in  all  three  test 
cases. 

•  Factor  /8,  the  set  of  types  of  boundary  set  variables,  showed  significantly  re- 
duced inter-group  complexity  of  the  identified  objects  in  terms  of  the  number 
of  different  types  manipulated  by  the  interface  of  groups  in  all  three  test  cases. 

The  factors  for  intra-group  complexity  are: 

•  Factors  /9  and  /10,  the  set  of  direct  internal  variables  and  indirect  internal 
variables,  respectively,  included  a  test  case  in  which  the  intra-group  complexity 
was  higher  in  version  1  than  in  version  2,  and  another  test  case  in  which  the 
intra-group  complexity  was  similar  in  both  versions;  these  factors  were  ignored 
in  the  third  test  case.  The  three  test  cases  did  not  illustrate  a  conclusive 
behavior  of  the  intra-group  complexity  expressed  by  these  factors. 
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•  Factor  fn,  the  interior  set,  included  a  test  case  in  which  the  intra-group  com- 
plexity was  higher  in  version  1  than  in  version  2  and  another  test  case  in  which 
the  intra-group  complexity  was  higher  in  version  2  than  in  version  1;  in  the 
third  test  case,  this  factor  was  similar  in  both  versions.  We  concluded  that  on 
the  average,  this  factor  was  similar  in  both  versions  and,  in  fact,  we  should 
observe  a  small  reduction  on  intra-group  complexity  measured  by  this  factor 
when  an  statistically  significant  population  is  used  for  the  comparison. 

•  Factor  /i2,  the  percentage  of  modules  accessing  all  global  variables,  was  not 
used  in  the  comparison  since  it  can  only  be  computed  for  those  groups  derived 
from  identified  objects  using  the  globals-based  analysis  algorithm. 

•  Factor  /i3,  the  sum  of  type  sizes  of  interior  set  variables,  included  a  test  case  in 
which  the  intra-group  complexity  was  higher  in  version  1  than  in  version  2,  and 
another  test  case  in  which  the  intra-group  complexity  was  higher  in  version  2 
than  in  version  1;  the  third  test  case  showed  slightly  reduced  intra-group  com- 
plexity in  the  identified  objects.  We  concluded  that  factor  /13  showed  slightly 
reduced  intra-group  complexity  in  the  groups  derived  from  identified  objects. 

•  Factor  /14,  the  set  of  types  of  interior  set  variables,  showed  similar  intra-group 
complexity  inside  the  groups  in  both  versions  of  the  program  as  observed  from 
the  number  of  different  types  manipulated  by  routines  in  each  version  in  all 
three  test  cases. 

The  data  from  the  three  test  cases  in  study  I  was  summarized  above,  and  shows 
some  evidence  of  the  effectiveness  of  the  object  finder  to  find  groups  that  exhibit 
desired  inter-group  and  intra-group  complexity  levels  as  compared  to  hierarchical 


109 


clustering  techniques.  It  also  shows  the  trend  of  the  results  expected  from  the  future 
evaluation  of  the  approach. 

The  data  illustrates  that  three  out  of  eight  inter-group  complexity  factors 
showed  significantly  reduced  complexity,  three  other  inter-group  complexity  factors 
showed  reduced  inter-group  complexity,  and  the  two  remaining  inter-group  complex- 
ity factors  showed  similar  inter-group  complexity.  Furthermore,  the  data  illustrates 
that  two  out  of  six  intra-group  complexity  factors  showed  slightly  reduced  intra-group 
complexity,  one  intra-group  complexity  factor  showed  similar  intra-group  complexity, 
and  the  three  other  intra-group  complexity  factors  were  not  used  in  the  study. 

The  implications  of  this  study  are  that  the  object  finder  approach  results  in 
a  modularization  of  a  program  which  exhibited  significantly  lower  inter-group  com- 
plexity and  moderately  lower  intra-group  complexity  when  compared  to  hierarchical 
clustering  techniques.  It  is  important  to  note  that  those  complexity  metrics  factors 
that  measure  the  intra-group  strength  were  not  considered  during  this  study  since 
they  measure  characteristics  of  a  program  which  are  not  considered  by  the  hierarchi- 
cal clustering  approach  [14].  Hence,  this  reduced  the  possibility  that  the  complexity 
metrics  factors  might  be  used  to  favor  the  object  finder  approach. 

In  this  interpretation  of  the  data  in  Study  I,  the  primitive  complexity  metrics 
factors  were  equally  weighted  with  respect  to  the  total  complexity  of  a  group.  During 
the  future  study,  the  factors  will  not  be  evenly  weighted  when  computing  the  total 
complexity.  In  fact,  one  of  the  proposed  future  research  consists  of  expressing  a 
mathematical  relation  between  these  complexity  metrics  factors.  In  Section  5.3,  we 
explained  the  intuitive  effect  of  each  factor  on  the  total  complexity.  The  mathematical 
relationship  expressing  the  total  complexity  will  be  obtained  by  specifying  the  weights 
of  each  complexity  metrics  factors  on  the  total  complexity. 


CHAPTER  6 
APPLICATIONS:  DESIGN  RECOVERY  BASED  ON  THE  APPROACH 

This  chapter  illustrates  the  potential  usefulness  of  our  approach  by  applying  it 
in  the  context  of  design  recovery. 

As  shown  in  Figure  6.1,  there  are  two  major  approaches  of  system  modulariza- 
tion in  the  object  finder.  The  input  in  each  case  is  a  "C"  program.  The  top-down 
approach  decomposes  a  whole  system  into  a  set  of  related  components.  The  output  of 
the  top-down  approach  is  a  collection  of  candidate  objects.  The  bottom-up  approach 
starts  from  a  user- identified  subcomponent  and  grows  up  until  all  related  subcom- 
ponents are  also  collected.  The  output  of  the  bottom-up  approach  is  an  extended 
candidate  object.  The  top-down  approach  provides  an  overall  understanding  of  the 
entire  system.  On  the  other  hand,  the  bottom-up  approach  provides  a  view  of  the 
information  which  is  closely  related  to  the  components  which  are  under  examination. 

The  process  of  object  identification  can  be  performed  top-down  which  results 
in  a  modularization  of  a  system  into  recursively  containing  objects  (the  objects  are 
themselves  modularized  into  other  set  of  objects).  The  kind  of  design  information 
obtained  by  this  process  corresponds  to  the  structure  of  the  program  in  terms  of  its 
components  and  the  relationships  between  those  components.  Next,  we  present  the 
methods  that  summarize  these  approaches  of  system  modularization. 

Method  1   Top-down  Analysis 

Input  :  A  candidate  object;  for  the  complete  system:  the  system  object,  otherwise  a 
simple  object. 

Output  :  A  collection  of  candidate  objects. 
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Figure  6.1.  The  object  finder  system  flow 
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Steps  : 

1.  Perform  either  or  both  of  Globals-based  Object  Finder  (Algorithm  1)  or 
Types-based  Object  Finder  (Algorithms  2). 

2.  Decompose  the  current  candidate  object  into  identified  candidate  objects. 

3.  Display  to  the  user  the  identified  candidate  objects. 

4-  Refine  a  selected  candidate  object  and  store  the  identified  candidate  objects 
for  future  reference. 

This  approach  is  complemented  with  heuristics  and  human  input.  These  heuris- 
tics will  be  abstracted  from  the  human  input  during  the  future  study  and  will  even- 
tually replace  the  human  input  in  the  object  finder.  The  kind  of  heuristics  in  this 
analysis  are  similar  to  those  utilized  by  a  designer  when  a  program  decomposition  is 
first  defined. 

Analysis.  The  time  complexity  associated  with  this  method  depends  on  the 
complexity  of  the  Globals-based  and  Types-based  Object  Finder  algorithms.  Each 
level  of  modularization  will  decrease  the  size  of  the  identified  objects  in  terms  of  the 
number  of  routines,  global  variables,  and  (abstract  data)  types;  then,  the  required 
time  to  analyze  an  object  will  also  decrease.  The  display  time  of  the  identified  objects 
is  proportional  to  the  number  of  identified  objects  at  each  level  of  decomposition. 

The  process  of  object  identification  could  also  be  as  follows: 

Method  2  Bottom-up  Analysis 

Input  :  An  initial  candidate  object  defined  by  the  user 

Output  :  The  extended  candidate  object. 
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Steps  : 

1.  Let  the  user  propose  a  starting  point — one  or  more  routines,  types,  or  data 
items  to  be  part  of  an  initial  candidate  object  (F,T,D). 

2.  Let  the  user  choose  if  we  will  next  apply  data-routine  analysis,  routine- 
data  analysis,  type-routine    analysis,  or  routine-type    analysis: 

(a)  If  he  chooses  data-routine  analysis:  add  to  F  the  set  of  all  routines 
that  use  data  items  in  D. 

(b)  If  he  chooses  routine-data   analysis:  add  to  D  the  set  of  all  global  and 
persistent  data  items  used  by  routines  in  F . 

(c)  If  he  chooses  type-routine  analysis:  add  to  F  the  set  of  all  routines 
that  use  types  in  T  for  formal  parameters  or  return  values. 

(d)  If  he  chooses  routine-type  analysis:  add  to  T  the  set  of  all  types  used 
as  formal  parameters  or  return  values  by  the  routines  in  F. 

3.  Draw  out  a  graph  showing  the  F ,  T  and  D  and  their  connections  based  on 
globals-based  analysis  (Algorithm  1)  and  types-based  analysis  (Algorithm  2) 
for  the  user  to  review.  He  may  delete  any  of  the  components  he  wishes 
and  either  repeat  the  previous  step  or  store  the  candidate  object  for  future 
reference. 

Analysis.  This  method  is  an  iterative  approach  to  group  the  related  compo- 
nents in  a  system  into  a  single  candidate  object.  The  initial  candidate  object  is 
determined  by  the  user.  Let  the  number  of  routines  proposed  by  the  user  at  the 
starting  point  be  r,  the  number  of  types  be  t,  and  the  number  of  global  variables  be 
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v.  The  complexity  of  the  initial  step  depends  on  the  kind  of  analysis  chosen  by  the 
user  as  follows: 

If  the  user  chooses  worst-case  time  complexity 

Data-routine  analysis  O(N.v) 

Routine-data  analysis  0(r.g) 

Type-routine  analysis  O(R.t) 

Routine-type  analysis  0(r.T) 

where  N  is  the  number  of  routines  in  the  system,  g  is  the  number  of  global  variables 
in  the  system,  and  T  is  the  number  of  (abstract  data)  types  in  the  system. 

The  next  step  consists  of  calculating  the  connections  between  components  of 
a  system  using  the  Globals-based  and  Types-based  Object  Finder  algorithms.  In 
Chapter  4,  the  bounding  time  complexities  of  these  two  algorithms  were  0(g2N2) 
and  0(max(T2,NT)),  respectively. 

Chapter  8  reports  on  the  usefulness  of  these  methods  for  recovering  the  design 
information.  Section  8.1  illustrates  the  Top-down  Analysis  method  and  Section  8.3 
illustrates  the  Bottom-up  Analysis  method. 


CHAPTER  7 
A  PROTOTYPE  FOR  THE  PROPOSED  APPROACH 

This  chapter  outlines  an  interactive  prototype  tool  which  implements  the  ap- 
proach. This  tool  was  designed  to  demonstrate  the  flexibility  and  portability  of  its 
design. 

The  object  finder  implements  both  Top-down  and  Bottom-up  Analysis  (Meth- 
ods 1  and  2)  as  well  as  the  Globals-based  and  Types-based  Object  Finder  (Algorithms 
1  and  2)  in  the  context  of  a  design  recovery  tool.  Appendix  A  presents  the  user's 
manual  of  the  object  finder  prototype. 

Top-down  Analysis  could  well  be  the  most  productive  way  to  get  a  first  cut 
at  the  object  content  of  an  entire  software  system,  especially  if  heuristics  can  be 
developed  to  speed  or  replace  human  intervention.  However,  both  algorithms  could 
also  be  used  "bottom  up"  and  together;  this  may  be  the  most  useful  way  to  organize 
an  exploratory  object  finding  tool. 

7.1     The  Object  Finder:  A  GNU  Emacs-based  Design  Recovery  Tool 

A  conceptual  model  of  the  object  finder  is  presented  in  Figure  7.1  which  shows 
the  interactions  between  the  object  finder  and  the  environment,  including  a  user,  a 
graphics  interface,  a  program  base,  an  object  base,  and  Emacs.  The  communication 
consists  of  the  command  control  flow,  issued  by  the  user,  and  data  flows  for  access 
to  the  program  and  object  bases.  The  program  to  be  analyzed  is  read  in  from 
the  program  base.  Currently,  the  prototype  analyzes  a  "C"  program  that  consists 
of  a  collection  of  source  files  and  a  make  file  which  describes  how  the  program  is 
constructed.  The  user  may  choose  to  use  Emacs  for  any  program  modification,  to 
use  the  object  finder  to  identify  high  level  objects,  or  to  use  the  graphics  interface 
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Figure  7.1.  The  object  finder  conceptual  model 
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to  display  the  objects  on  a  Sun  workstation.  The  object  base  consists  of  the  set  of 
candidate  objects  identified  by  the  object  finder. 

7.2     Design  Goals  of  the  Object  Finder 

The  object  finder  was  implemented  to  achieve  several  design  goals  including 
ease  of  use,  portability,  and  flexibility.  The  GNU  Emacs  programming  environment 
provided  a  prototyping  system  with  these  and  other  attributes  [40].  GNU  Emacs  is  an 
extensible  editor  which  derives  much  of  its  power  and  flexibility  from  its  organization. 
As  illustrated  in  Figure  7.2,  we  use  a  similar  organization  in  the  object  finder.  In 
the  figure,  the  object  finder  implementation  consists  of  the  functions  implemented  in 
GNU  Emacs  LISP,  the  keyboard  interpreter  which  provides  functionality  for  capturing 
user's  input  from  the  keyboard,  and  the  display  routines  which  present  output  to  the 
user's  terminal.  In  addition,  Emacs  interacts  with  external  processes  such  as  a  cross- 
reference  generator  and  a  graphics  user  interface. 

The  internal  structure  of  Emacs  was  used  to  organize  the  functions  of  the  object 
finder.  Processes  were  used  to  control  the  activities  of  the  object  finder,  buffers  were 
used  as  data  structures  to  store  the  information  it  generated,  and  windows  were  used 
to  display  the  text  output  on  the  user's  terminal.  The  steps  of  the  object  finder  are 
as  follows:  First,  the  object  finder  is  initialized  by  creating  the  symbol  table  of  the 
program.  Second,  the  "C"  program  is  decomposed  into  identified  objects  using  the 
proposed  algorithms,  and  the  relationships  among  objects  are  derived.  Third,  text 
output  is  generated,  stored  into  the  output  buffers,  and  displayed  in  the  windows. 
Finally,  the  identified  objects  and  the  relationships  are  graphically  displayed  on  an 
X-window  user  interface  [28]. 
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Figure  7.2.  The  object  finder  implementation  outline 
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7.3     Design  of  the  Object  Finder 

The  design  of  the  object  finder  was  developed  in  several  stages.  The  initial 
design  was  the  kernel  of  the  object  finder  containing  basic  10  functions  which  were 
separated  from  other  components  to  minimize  the  risks  of  personnel  changeovers. 

An  example  of  these  basic  10  functions  is  the  following  set  of  Emacs  LlSP  rou- 
tines in  Figure  7.3  which  implement  a  scanner  of  tokens  in  a  given  input  stream. 
The  program  under  analysis  consists  of  several  kind  of  identifiers  including  variables, 
types,  routines,  reserved  words,  and  file  names,  as  well  as  numeric  and  string  con- 
stants. In  this  example,  function  get-token  gets  the  next  available  token  from  the 
input  which  is  a  "C"  program.  If  a  token  has  been  previously  recognized  from  the 
input,  the  global  variable  token_p  is  true  and  the  current  value  of  the  global  variable 
token  is  returned;  otherwise,  a  new  token  is  scanned  from  the  input  stream.  The 
function  succ-token  looks  ahead  to  the  next  input  token.  The  function  get-tokenO 
scans  the  input  stream  for  available  tokens;  this  function  recognizes  the  identifiers 
and  operators  in  a  C  input  program.  The  first  step  in  the  functions  is  to  determine 
the  input  stream  (a  buffer)  by  using  the  function  set-buffer  to  set  the  current 
buffer  to  the  buffer  named  FROM.  The  function  more  forms  a  token  from  the  input 
characters.  The  function  member  determines  set  membership  of  its  first  parameter  in 
its  second  parameter. 

Several  other  functions  were  provided  to  support  the  previous  functions  includ- 
ing functions  to  read  a  character  from  the  input  (get-byte);  to  determine  whether 
a  character  is  an  alphabetic  character  (alpha-c),  a  digit  character  (digit-c),  an 
end-of-buffer  mark  (end-of-buff er),  or  a  blank  (including  tabs,  spaces,  and  end- 
of-lines)  (blank);  to  look  ahead  at  the  next  input  character  (succ-char);  and  to 
retrieve  the  next  character  as  a  number  (get-char).  Finally,  the  function  newlines 
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adds  a  newline  mark  to  each  file  in  the  list  of  files  FILES,  the  function  skip-line 
moves  the  start  of  next  line  in  the  buffers,  and  the  function  add-lineno2  increments 
the  line  counter  in  the  buffer  FROM. 

One  of  the  most  interesting  design  decisions  was  the  use  of  Emacs  buffers  for 
implementing  most  of  the  interfaces  among  the  components  of  the  object  finder.  A 
buffer  holds  edited  data  in  Emacs  and  could  be  associated  with  one  or  more  windows. 
Buffers  are  used  as  an  interface  to  store  the  program  under  analysis,  the  set  of  iden- 
tified candidate  objects,  the  data  generated  during  the  analysis,  and  the  information 
to  be  displayed.  In  Figure  7.1,  the  interface  between  the  program  base  and  the  ob- 
ject finder  was  implemented  in  terms  of  buffers  as  well  as  the  interfaces  between  the 
object  base  and  the  object  finder,  and  between  the  object  base  and  the  graphics  user 
interface. 

Another  design  decision  Was  the  use  of  processes  to  handle  several  external 
functions.  Processes  are  primarily  used  to  interact  with  external  programs  that  gen- 
erate the  program  symbol  table  and  the  cross  reference  information,  or  to  provide 
the  graphics  interface. 

For  example,  a  process  is  used  to  link  to  the  ANSI  "C"  cross-reference  tool. 
Two  functions  are  used  to  control  this  process;  Figure  7.4  shows  part  of  the  process- 
control  code  in  one  of  the  functions.  A  process  communicates  with  the  object  finder 
through  buffers.  Function  start-process  starts  a  program  and  returns  the  process 
identifier  [20].  The  input  buffer,  program-file,  to  this  process  consists  of  the  "C" 
source  code  for  which  cross-reference  information  is  needed  and  the  output  buffer, 
cross-ref-set,  will  contain  this  information.  The  external  program  invoked  by  this 
process  is  called  acx.  Finally,  the  data-file-name-directory  specifies  the  data 
directory  where  the  source  files  are  located. 
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(defun  get-token  (from  feoptional  If  files) 

"Get  next  token  from  buffer  FROM.  LF  is  non  nil  if  the  newline  mark 
is  treated  as  a  token.  FILES  is  a  list  of  files  associated  with 
the  input  stream." 
(set-buffer  from) 
(let  ((s  (get-tokenO  from  If  files))) 

(cond  ((equal  s  ''/*'')    (more  'comment  from) 

(get -token  from  If  files)) 
(t  s)))) 

(defun  get-tokenO  (from  If  files) 

"Get  next  token  from  buffer  FROM;  move  point  too.  LF=t  if  a  newline 
mark  is  considered  as  a  token." 
(set-buffer  from) 

(cond  (token_p  (setq  token_p  nil)  token) 
((end-of -buffer  from)  nil) 
((and  If  (=  10  (succ-char  from)))  (get-byte  from) 

(add-lineno2  from)  (newlines  files)  (get-tokenO  from  If  files)) 
((blank  (succ-char  from))  (get-char  from) 

(get-tokenO  from  If  files)) 
((=  (succ-char  from)  10)  (get-char  from)  (add-lineno2  from) 

(get-tokenO  from  If  files)) 
((alpha  (succ-char  from))  (more  'alphas  from)) 
((digit  (succ-char  from))  (more  'digits  from)) 
((member  (succ-char  from)  ' (?=  ?*  ?%   ?"  ?!))  (more  '=  from)) 
((member  (succ-char  from)  '(?-))  (more  '-  from))  ;"-| — |-=|->" 
For  simplicity,  only  operators  "-",  " — ",  "-■",  and  "->"  are 
included.  Other  operators  such  as  "/"  "/="  "/*"  "+"  "++" 

"+="  "Xr"     "&&"  "&="  "I"  "II"  "|="  "<"  "<="  "<<"  "<<  =  "  "s"  ">=" 

..»,.  n»«ii  iiji  "...",  can  be  handled  in  a  similar  fashion, 
(t  (get -byte  from)))) 

(defun  succ-token  (from  feoptional  If  files) 
"Look  ahead  the  next  token  in  buffer  FROM." 
(set -buffer  from) 
(cond  (token.p  token) 

(t  (setq  token  (get -token  from  If  files))  (setq  token_p  t) 

token))) 


Figure  7.3.  A  scanner  of  tokens 


(start -process   "process-name"   cross-ref-set   "acx" 
"-c"    (concat  data-file-name-directory  program-file)) 


Figure  7.4.  Process  to  control  the  ANSI  "C"  cross-reference  tool  acx 
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Figure  7.5.  Xobject  commands 


Emacs  buffers  hold  processed  data,  including  the  symbol  table,  the  identified 
candidate  objects,  the  graphs  and  matrices  used  during  the  analysis  of  a  program, 
and  the  display  information.  The  symbol  table  is  implemented  as  a  set  of  unique 
ID's,  one  for  each  identifier  in  the  program.  The  property  lists  of  GNU  Emacs  LlSP 
were  used  to  describe  the  characteristics  of  identifiers. 

User  interfaces  have  been  recognized  as  one  the  most  critical  parts  of  a  system. 
The  user  interface  of  the  object  finder  consists  of  menu-driven  as  well  as  graphics 
components.  The  input  of  the  menu-driven  interface  was  implemented  by  using  an 
Emacs  minibuffer.  Output  information  is  displayed  in  a  pre-set  window  layout  which 
may  be  adjusted  by  the  user.  The  graphics  user  interface  provides  methods  to  display 
and  modify  the  identified  objects  and  their  relationships.  This  interface  is  explained 
in  the  following  section. 

7.4     Xobject:  Graphical  User  Interface 

Xobject  is  invoked  by  the  user  in  the  object  finder.  It  allows  a  user  to  render 
a  graphics  representation  of  the  identified  objects,  modify  the  objects,  and  save  the 
modified  objects  for  later  use  by  the  object  finder.  The  first  step  in  xobject  is  to 
load  the  candidate  objects.    The  top  part  of  xobj  ect  display  consists  of  the  menu 
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which  is  listed  in  Figure  7.5. 


DIR 


allows  you  to  set  the  working  directory.    LOAD 


loads  a  file  to  be  analyzed.    After  a 


LOAD 


,  the  identified  objects,  along  with  their 


relationships,  are  shown.  |  MOVE  [  allows  you  to  modify  the  arrangement  of  the  layout 
of  the  objects. 

Other  xobject  commands  include  to  temporarily  hide  an  object  and  all  its 
relationships,  to  permanently  remove  (  DESTROY 


relationships  between  objects  (|  RELATE  |),  to  save  modifications  ( 


)  an  object  or  relationships,  to  add 

),  to  recreate 


SAVE 


the  relations  between  existing  objects  (|  REHASH  |),  and  to  restore  the  original  data 


(|  RESTORE  |). 


CHAPTER  8 
EXPERIENCE  WITH  THE  OBJECT  FINDER 


This  chapter  presents  our  experience  in  using  the  object  finder  to  analyze  pro- 
grams. Our  objective  was  to  recover  the  design  of  programs  by  modularization  using 
the  algorithms  outlined  above. 

An  example  used  to  illustrate  the  top-down  analysis  is  the  name  cross-reference 
program.  Another  example  of  the  top-down  analysis  is  a  simple  algebraic  expression 
evaluation  program  which  was  translated  by  hand  from  a  C++  program  with  about 
1,350  lines  of  code  and  50  functions.  Finally,  an  example  of  the  bottom-up  analysis 
is  given  using  the  name  cross-reference  program. 

8.1     Example  of  the  Top-down  Analysis:  Name  Cross-reference  Program 

This  section  illustrates  how  to  use  the  object  finder  in  a  top-down  analysis 
of  a  system,  its  graphics  user  interface  xobject,  and  some  of  the  design  recovery 
capabilities  of  the  object  finder. 

The  following  example  [26]  is  a  name  cross-reference  program  for  "C"  programs 
whose  size  statistics  are  given  in  Table  8.1.    The  sample  program  consists  of  five 

Table  8.1.  Statistics  of  the  name  cross-reference  program. 


xref.c 

xreLin.c 

xref_out.c 

xreLtab.c 

xref_tab.h 

Total 

Lines  of  code 

27 

56 

41 

126 

32 

282 

Global  variables 

0 

1 

0 

0 

0 

1 

Functions 

1 

2 

1 

6 

0 

10 

Types  (user-defined) 

1 

0 

0 

5 

4 

10 
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source  files  including  one  C  header  file  and  four  program  files.  As  explained  above, 
there  is  a  make  file  that  contains  a  list  of  these  files. 

The  top-down  analysis  approach  was  used  with  both  the  globals-based  and 
types-based  analyses;  we  choose  to  ignore  "C"  primitive  types  during  the  types-based 
analysis.  The  identified  objects  are  shown  in  Figure  8.1.  In  Figure  8.1,  a  unique  ID 
has  been  assigned  to  each  identifier  by  the  system  automatically.  For  example,  unique 
ID  xref  _in~getachar~ch~58  (not  shown  in  Figure  8.1)  refers  to  variable  ch  declared 
in  function  getachar  which  is  located  in  file  xref  _in.  In  the  case  of  a  "C"  type,  the 
corresponding  unique  ID  has  empty  function  and  file  components;  for  a  type  denoting 
a  "C"  array,  the  corresponding  unique  ID  includes  the  type  of  the  array  elements 
followed  by  its  dimension  information,  e.g.,  ""char — ~49  denotes  a  one-dimensional 

array  of  characters  with  unspecified  size.   In  addition,  the  special  "function" II 

denotes  the  scope  where  external  "C"  functions  and  variables  are  defined.  Finally, 
the  unique  IDs  are  accompanied  by  a  code  which  denotes  the  kind  of  identifier  (R  for 
routines,  T  for  types,  and  G  for  global  data)  and  a  copy  of  the  identifier  name;  for 
user-defined  types,  additional  information  about  the  type  is  given  such  as  the  number 
of  elements  in  each  dimension  for  an  array. 

The  name  shown  for  each  candidate  object  consists  of  a  prefix  z#obj ,  a  code  G 
for  globals-based  objects  or  T  for  types-based  objects,  the  name  of  a  global  variable 
or  type  associated  with  the  object,  and  a  suffix  consisting  of  #  and  a  unique  tag 
number.  A  candidate  object  is  described  by  listing  the  set  of  routines,  types,  and 
global  variables  associated  with  the  object  during  the  analysis. 

The  identified  objects  are  shown  in  Figure  8.2  as  they  appear  using  the  graphics 
user  interface  xobject.  At  the  top  of  Figure  8.2,  there  are  a  series  of  buttons  which 
show  the  capabilities  of  this  interface.  There  are  buttons  to  set  the  working  directory, 
to  load  the  candidate  objects  and  display  their  relationships,  and  to  modify  the 
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Obj 

} 
Obj 


} 
Obj 

} 

Obj 


> 

Obj 


} 
Obj 


ect  z#objTLINEPTR#3  is  { 

~~LINEPTR~16 

xref _tab~ II~make_linenode~138 

ect  z#objTWTPTR*#4  is  { 
~~WTPTR*~104 

xref _tab~ II~make_wordnode~140 

xref_tab~ II~init_tab~102 

xref _tab~ II~addword~64 

xref _tab~ II~add_lineno~61 

xref _out~ II~writewords~181 

ect  z#objTchar*#5  is  { 

~~char*~56 

xref _tab~ II~strsave~164 

ect  z#objTint*#6  is  { 
~~int*~69 

xref" II~main~137 

xref _in~ II~getachar~93 

ect  z#objT+undetermined-type#7  is  { 

~~int*~69 

~~char~~49 

xref _in~ II~getword~98 

ect  z#objGlineno#2  is  { 

xref _in~ II~lineno~130 

xref _in~ II~getword~98 

xref _in~ II~getachar~93 


(T)LINEPTR 
(R)make_linenode 


(T)WTPTR* 

(R) raake_wordnode 

(R)init_tab 

(R) addword 

(R)add_lineno 

(R)writewords 


(T)char* 
(R)strsave 


(Dint* 
(R)main 
(R)getachar 


(T)int* 

(T)char[] 

(R)getword 


meno 


(G)l 

(R)getword 

(R)getachar 


Figure  8.1.  Candidate  objects  identified  in  the  name  cross-reference  program 
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arrangement  of  the  layout  of  the  objects.  The  components  of  each  identified  object 
(data  items,  data  types,  and  functions)  are  grouped  in  a  light  rectangle.  The  dark 
rectangles  represent  common  components  between  two  objects. 

A  user  may  modify  the  components  of  each  object  by  changing  the  name,  adding 
or  deleting  components,  or  moving  components  among  objects  (see  Section  7.4  for 
a  complete  explanation  of  xobject  capabilities).  An  example  of  the  kind  of  mod- 
ifications that  can  be  made  to  the  identified  objects  is  to  move  a  component  from 
a  candidate  object  and  add  it  to  another  object  such  as  removing  routine  getword 
from  object  T+undetermined-type#7  and  routine  getachar  from  object  Tint*#6. 
After  such  a  modification,  the  user  will  recreate  the  relationships  between  the  ex- 
isting objects.  The  result  is  that  the  relationships  between  object  Glineno#2  and 
objects  Tint*#6  and  T+undetermined-type#7,  in  Figure  8.2,  will  disappear  because 
the  common  components  between  them  have  been  eliminated. 

The  design  recovery  capabilities  of  the  object  finder  can  be  appreciated  in  this 
example.  The  output  of  the  object  finder  should  be  useful  to  the  maintenance  pro- 
grammer since  it  provides  high  level  information  about  the  structure  of  the  program. 
In  this  example,  the  object  finder  was  able  to  identify  key  aspects  of  the  original  design 
of  the  cross-reference  program  by  suggesting  a  decomposition  of  it  into  objects  and 
graphically  presenting  the  relationships  among  the  objects.  For  example,  the  struc- 
ture of  the  name  cross-reference  program  in  Figure  8.2  consists  of  four  clusters1  of  can- 
didate objects,  the  first  consisting  of  objects  T+undetermined-type#7,  Tint*#6,  and 
Glineno#2,  the  second  consisting  of  TWTPTR*#4,  the  third  consisting  of  TLINEPTR*#3, 
and  the  fourth  consisting  of  Tchar*#5. 

After  viewing  this  suggested  initial  structure,  the  user  is  able  to  interactively 
modify  the  objects  and  their  relationships.    As  explained  above,  two  components 


1These  are  different  from  Hutchens  and  Basili's  hierarchical  clusters. 
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DIR  EXECUTE  LOAD  RESTORE  SAVE  RELATE  DESTROY  HOVE  REHASH  QUIT 


TWTPTR*#4 


z#objTWTPTR*i4 


xref_put- 

xref_tab- 

xref_tab- 

xref_tab- 

xref_tab-_ 

~~WTPTR*~104 


_II~writewords~181 
_I  I  ~  add_l  ineno-  6 1 
II~addword~64 
[ll~init_tab~102 
_II~make_wordnode~| 


z#objTLINEPTR#3 


TLINEPTR#3 


xref_tab~ II~make_linenode~l 

~~LINEPTR~16 


lxref_in~ II~getword~98 


Gl ineno #2 


z#objGlinenol2 


xref_in~ 

xref_in~_ 

xref_in~ II~getword~98 


_II~lineno~130 
II~getachar~93 


! 


T 


xref_in~ I I~getachar~93 


Tchar* 15 


z#objTchar*#5 


xref_tab~ II~strsave~164 

~~char*~56 


T+undet  ermined-type  tt  7 


2 #objT+undetermined-type#7 


xref_in~ II~getword~98 

~~char 49 

~~int*~69 


•int*~69 


Tint*  #6 


z lob j Tint* #6 


Y 


xref_in~ II~getachar~93 

xref- II~main~137 

~~int*~69 


Figure  8.2.    Candidate  objects  identified  in  the  name  cross-reference  program  dis- 
played by  xob j  ect 


Table  8.2.  Statistics  of  the  algebraic  expression  evaluation  program. 
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exprtst.c 

function. c 

symbol,  c 

state.c 

function.h 

symbol,  h 

state.h 

Total 

Lines  of  code 

88 

759 

75 

147 

140 

33 

82 

1324 

Global  variables 

0 

0 

2 

0 

5 

0 

3 

10 

Functions 

1 

28 

7 

14 

0 

0 

0 

50 

Types  (user-defined) 

3 

6 

2 

0 

1 

1 

1 

14 

were  removed  from  two  different  candidate  objects  to  provide  a  clearer  design.  The 
resulting  objects  after  the  modifications  are  shown  in  Figure  8.3. 

In  addition,  the  programmer  is  able  to  manually  cluster  related  objects  into  a 
larger  object.  After  the  modifications  concerning  the  removal  of  routines  getword  and 
getachar  from  objects  T+undetermined-type#7  and  Tint*#6,  respectively,  we  define 
five  clusters  of  candidate  objects  in  the  cross-referencer  program.  The  largest  cluster, 
used  for  input  and  output  in  this  program,  consists  of  T+undetermined-type#7  and 
Tint*#6.  Each  of  the  objects  TWTPTR*#4,  TLINEPTR*#3,  and  Glineno#2,  constitutes 
a  cluster  of  objects  of  size  one  and  consists  of  an  object  that  implements  a  given 
abstract  data  type.  Finally,  the  object  Tchar*#5  constitutes  a  cluster  of  objects  of 
size  one  which  denotes  an  object  that  performs  utility  functions.  The  next  step  of 
refinement  would  disregard  this  routine  from  consideration  by  disabling  it  for  the  rest 
of  the  analysis. 

8.2     Comparison  with  C++  Classes:  Algebraic  Expression  Evaluation  Program 

This  section  presents  the  results  of  study  II  mentioned  in  Chapter  5.  In  this  sec- 
tion, a  "C"  program  which  is  translated  by  hand  from  a  "C++"  program  [44]  is  used 
to  illustrate  further  design  recovery  capabilities  of  the  object  finder.  The  statistics  of 
the  translated  version  in  "C"  are  listed  in  Table  8.2.  The  top-down  analysis  approach 
was  executed,  based  on  both  globals-based  and  types-based  analysis  methods.  Once 
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Figure  8.3.  Modified  candidate  objects  in  the  name  cross-reference  program  displayed 
by  xob j  ect 
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Table  8.3.  Comparison  of  candidate  objects  and  "C++"  classes. 


Object  in  "C" 

"C++"  class 

Explanation 

G:  table_index#2 

symbol_table 

All  data  and  routines 
are  matched. 

G:  f irst#5 

function 

Almost  all  data  and  routines 
are  matched  except 
one  routine. 

T:  N0DE*#6 

node 

The  C++  classes  variable, 

variable 

plus,  divide,  constant, 

plus 

multiply,  and  minus 

divide 

are  subclasses  of  the  C++ 

constant 

class  node. 

multiply 

minus 

G:  expression#4 

stateO 

The  class  stateO  is  divided 

G:  paren_count#3 

into  two  candidate  objects. 

T:  char*#7 

symbol_table 

This  part  was  identified 

function 

under  user's  request. 

stateO 

again,  we  choose  to  ignore  the  primitive  types  during  the  types-based  analysis.  The 
identified  candidate  objects  are  displayed  in  Figures  8.4  and  8.5. 

In  Table  8.3,  we  compared  the  identified  candidate  objects  with  the  original 
"C++"  classes. From  Table  8.3,  we  observed  that  the  "C++"  class  symbol_table 
perfectly  matches  the  candidate  object  Gtable_index#2,  and  that  the  "C++"  class 
function  almost  perfectly  matches  the  candidate  object  Gf  irst#5  in  terms  of  the  set 
of  routines  in  those  candidate  objects.  Some  differences  between  the  derived  objects 
and  the  original  "C++"  classes  are  explained  below. 

Object  TN0DE*#6  is  different  from  the  "C++"  classes  listed  in  the  third  row  of 
Table  8.3  because  node  is  a  friend  class  to  the  other  classes  in  the  "CH — h"  version  of 
the  program  and  the  translated  "C"  code  merged  those  classes  together  as  one  com- 
bined type.  In  addition,  Gexpression#4  and  Gparen_count#3  are  combined  as  one 
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Object  z#objTN0DE*#6  is  { 
~~N0DE*~15 

function" II~Variable_eval~140 

function" II~Plus_eval~104 

function" II~Node_eval~101 

function" II~Multiply_eval~96 

function" II~Minus_eval~94 

function" II~Function_precedence~82 

function" II ~Function_delete_f unction 

function" II~Function_checkandadd~68 

function" II~Function_build_tree~66 

function" II~Function_add_operator~62 

function" II~Function_add_operands~60 

function" II~Echo_treeO"29 

function" II~Divide_eval~22 

function" II "Const ruct_f unction" 16 

function" II~Constant_eval~12 

} 

Object  z#objTchar*#7  is  { 
~~char*~18 

symbol" II"Symbol_table_get_index~136 

symbol" II~Symbol_table_add_variable~ 

state" II~State9_transition"128 

state" II~State7_transition~126 

state" II~State6_transition~124 

state" II~State5_transition~122 

state" II~State4_transition~120 

state" II~State3_transition"118 

state" II~State2_transition~116 

state" II~Statel_transition~114 

state" II~StateO_transition~112 

state" II~StateO_get_char~109 

function" II~Function_valid~85 

function" II~Function_parenthesis~80 

function" II~Echo  tree~27 

} 


T)N0DE* 

R)Variable_eval 

R)Plus_eval 

R)Node_eval 

R)Multiply_eval 

R)Minus_eval 

R) Funct  ion_precedence 

R)Function_delete_f unction 

R) Funct ion_checkandadd 

R) Funct ion_build_tree 

R) Funct ion_add_operator 

R) Funct ion_add_operands 

R)Echo_treeO 

R)Divide_eval 

R)Construct_f unction 

R)Constant_eval 


T) char* 

R) Symbol_table_get_index 

R)Symbol_table_add_variable 

R)State9_transition 

R)State7_transition 

R) State6_transition 

R) State5_transition 

R)State4_transition 

R) State3_transition 

R)State2_transition 

R)Statel_transition 

R)StateO_transition 

R)StateO_get_char 

R) Funct ion_val id 

R)Function_parenthesis 

R)Echo_tree 


Figure  8.4.    Types-based  candidate  objects  identified  in  the  algebraic  expression  e- 
valuation  program 
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Object  z#objGtable_index#2  is  { 

symbol" II~table~284 

symbol" II~table_index"286 

symbol" II~Symbol_table_get_value~138 

symbol" II~Symbol_table_get_index~136 

symbol" II~Symbol_table_clear~135 

symbol" II~Symbol_table_add_variable~ 

symbol" II~Symbol_table_add_value~130 

symbol" II~Echo_symbol_table~26 

symbol" I I "Construct _symbol_t able" 20 


} 
Obj 


} 
Obj 


} 

Obj 


ect  z#objGparen_count#3  is  { 

state" II~paren_count~254 

state" II~StateO_inc_paren~lll 

state" II~State0_get_paren~110 

state" II~StateO_dec_paren~108 

ect  z#objGexpression#4  is  { 

state" II~index~213 

state" II "express ion" 178 

state" II~StateO_get_char~109 

state" II~Construct_stateO~19 

function" II~Function_valid~85 

ect  z#objGf irst#5  is  { 

function" II~total_paren~292 

function" II "root "267 

function" II~queue~263 

function" II~last~216 

function" II "first "193 

function" II"Function_precedence~82 

function" II~Function_parenthesis~80 

function" II~Function_ovldop~73 

function" II ~Function_delete_f unction 

f unct  i  on" II "Funct  ion_checkandadd~  68 

function" II "Funct ion_build_tree~66 

function" II "Funct ion_add_operator~62 

function" II~Function_add_operands~60 

function" II~Echo_treeO~29 

function" II~Echo_queue"25 

function"   _II~Construct  function" 16 


(G)table 

(G) table_index 

(R) Symbol_table_get_value 

(R) Symbol_table_get_index 

(R) Symbol_table_clear 

(R) Symbol_table_add_variable 

(R) Symbol_table_add_value 

(R)Echo_symbol_table 

(R)Construct_symbol_table 


(G) paren_count 
(R) StateO_inc_paren 
(R) StateO_get_paren 
(R) StateO_dec_paren 


(G) index 

(G) expression 

(R) StateO_get_char 

(R)Construct_stateO 

(R) Funct ion_val id 


(G)total_paren 

(G)root 

(G) queue 

(G)last 

(G)first 

(R) Funct ion_precedence 

(R) Funct ion_parenthes is 

(R) Funct ion_ovldop 

(R) Funct  ion_delete_f unct  ion 

(R) Funct ion_checkandadd 

(R)  Funct  ion_build_tree 

(R) Funct ion_add_operator 

(R) Funct ion_add_operands 

(R)Echo_treeO 

(R) Echo_queue 

(R) Construct_f unct  ion 


Figure  8.5.    Globals-based  candidate  objects  identified  in  the  algebraic  expression 
evaluation  program 
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group  stateO  in  the  original  "C++"  code.  The  reason  is  that  paren_count  is  used 
as  a  counter  of  the  variable  expression.  A  semantic  analysis  is  necessary  to  provide 
such  a  connection,  which  is  not  currently  supported  by  the  object  finder.  This  will  be 
explained  for  the  future  work  of  this  dissertation.  Finally,  object  Tchar*#7  consists 
of  routines  corresponding  to  several  "C++"  classes  routines.  Those  routines  which  o- 
riginally  belonged  in  the  "C++"  classes  symbol_table  and  function  were  removed, 
using  the  graphics  user  interface  xobject,  from  object  Tchar*#7,  and  added  to  ob- 
jects Gtable_index#2,  Gfirst#5,  and  Gexpression#4.  After  the  modifications,  the 
resulting  object  Tchar*#7  is  part  of  the  cluster  of  objects  corresponding  to  the  "C++" 
class  stateO  which  already  includes  objects  Gexpression#4  and  Gparen_count#3. 

We  have  compared  the  objects  found  by  the  object  finder  with  those  (classes) 
defined  in  the  original  "C++"  program.  There  was  great  similarity  between  the  two, 
demonstrating  the  design  recovery  capabilities  of  the  object  finder.  We  noted  that 
object  TN0DE*#6  included  the  parent  class  node  and  its  subclasses  variable,  plus, 
divide,  constant,  multiply,  and  minus  from  the  original  "C++"  program. 

Figure  8.6  illustrates  the  graphics  representation  of  the  objects  modified  as 
explained  above.  Object  TN0DE*#6  is  related  to  candidate  object  Gf  irst#5  because 
of  the  commonality  between  routines  in  Gf  irst#5  (that  update  or  reference  their 
global  variables)  and  routines  in  object  TN0DE*#6  (that  manipulate  their  type,  NODE). 
We  argue  that  this  cluster  is  a  "natural  cluster"  for  two  reasons:  firstly,  except  for  two 
routines  added  by  object  Gf  irst#5  to  the  cluster,  the  same  set  of  routines  appears  in 
both  objects,  and  secondly,  the  global  variables  (total_paren,  root,  queue,  last, 
and  first)  in  object  Gf  irst#5  support  the  manipulation  of  entities  of  type  NODE  in 
the  program. 
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II~first~193 
.II~last~216 

II~queue~263 
JI~root~267 
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Figure  8.6.   Candidate  objects  identified  in  algebraic  expression  evaluation  program 
displayed  by  xob  j  ect 
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In  addition,  objects  Tchar*#7  and  Gexpression#4  are  related  because  of  the 
common  routine  StateO_get_char  in  both  objects.  This  cluster  indicates  the  exis- 
tence of  a  class  stateO  to  the  user.  Object  Gparen_count#3  is  not  part  of  this  cluster 
even  though  it  is  part  of  the  cluster  related  to  class  stateO  in  Table  8.3.  Thus,  the 
object  finder  signals  the  specialized  nature  of  object  Gparen_count#3  which  consti- 
tutes a  cluster  of  objects  of  size  one.  Finally,  the  object  Gtable_index#2  constitutes 
a  cluster  of  size  one  which  denotes  an  object  that  performs  symbol  table  management 
functions. 

8.3     Example  of  the  Bottom-up  Analysis:  Name  Cross-reference  Program 

This  section  illustrates  the  use  of  the  object  finder  in  a  bottom-up  analysis  of 
a  system  and  some  of  the  design  recovery  capabilities  of  the  object  finder.  For  this 
illustration  we  use  the  example  from  Section  8.1. 

The  bottom-up  analysis  approach  was  used  with  both  globals-based  and  types- 
based  analyses;  we  choose  to  ignore  "C"  primitive  types  during  the  types-based  anal- 
ysis. In  this  particular  example,  the  initial  goal  of  the  user  was  to  identify  the 
candidate  object  that  consists  of  the  set  of  routines  and  data  types  which  reference  a 
global  variable  in  the  program;  specifically,  global  variable  lineno  whose  unique  ID 
is  xref_in~ II~lineno~130. 

The  process  of  bottom-up  analysis  begins  with  the  user  proposing  a  starting 
point:  an  initial  candidate  object  which  consists  of  several  routines,  types,  or  data 
items.  The  initial  candidate  object  in  this  example  consists  of  the  global  variable 
lineno,  the  set  of  routines  which  the  user  considers  are  suspect  of  manipulating  the 
global  variable  lineno  (initially  only  routine  main  is  suspect),  and  the  set  of  types 
of  the  return  values  and  parameters  of  the  suspect  routines.  Figure  8.7  shows  this 
initial  candidate  object. 
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Object  z#objmkf ile#l  is  { 

xref _in~ II~lineno~130 

~~char*~56 

xref" II~main~137 

} 


(G)lineno 

(T)char* 

(R)main 


Figure  8.7.   Initial  candidate  object  defined  by  the  user  in  the  name  cross  reference 
program 


Object  z#objmkf ile#l   is  { 

xref _in~ II~lineno~130 

~~char*~56 

xref" I I "main" 137 

xref _in~ II~getword~98 

xref _in~ II~getachar~93 

} 


(G)lineno 

(T)char* 

(R)main 

(R)getword 

(R)getachar 


Figure  8.8.  Extended  candidate  object  resulting  from  the  data-routine  analysis 

The  next  step  of  the  process  is  that  the  user  specifies  the  kind  of  refinement 
analysis  of  the  initial  candidate  object.  At  first,  the  user  chooses  the  data-routine 
analysis  which  consists  of  adding  the  set  of  all  routines  that  use  the  global  variable  to 
the  current  set  of  routines  in  the  object.  After  the  analysis  is  automatically  performed 
by  the  object  finder,  the  extended  candidate  object  is  shown  in  Figure  8.8. 

Figure  8.8  shows  the  extended  candidate  object.  From  this  candidate  object 
the  user  can  determine  that  routines  get  word  and  getachar  also  use  the  specified 
global  variable.  The  user  will  benefit  from  this  knowledge  by  better  understanding 
the  set  of  routines  which  should  be  considered  when  making  a  change  that  affects  the 
global  variable. 

Next,  the  user  decide  to  perform  two  more  iterations  of  the  bottom-up  analysis 
to  build  up  his  understanding  about  the  affected  routines  when  a  change  is  made 
on  global  variable  lineno  and  the  data  type  "pointer  to  char."   The  first  iteration 
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Object  z#objmkf ile#l  is  { 

xref _in~ II~lineno~130 

~~int~4 
~~int*~69 

char — "49 
~~char*~56 
~~WTPTR~26 
~~LINEPTR~16 

xref" II~main"137 

xref _tab~ II~strsave~164 

xref _tab~ II~make_wordnode~140 

xref _tab~ II~make_linenode~138 

xref _tab~ II~addword~64 

xref _tab~ II~add_lineno~61 

xref _out~ II~writewords~181 

xref _in~ II~getword~98 

xref _in" II~getachar~93 


(G)lineno 

(T) int 

(T) int* 

(T)char[] 

(T)char* 

(T)WTPTR 

(T)LINEPTR 

(R)main 

(R)strsave 

(R)make_wordnode 

(R)make_linenode 

(R)addword 

(R)add_lineno 

(R)writewords 

(R)getword 

(R)getachar 


Figure  8.9.  Final  candidate  object 

consists  of  a  type-routine  analysis  which  involves  adding  to  the  current  set  of  routines 
in  the  object,  the  set  of  all  routines  that  use  the  types  in  the  object  as  formal 
parameters  or  return  values.  The  second  iteration  consists  of  a  routine-type  analysis 
which  adds  to  the  current  set  of  types  in  the  object,  the  set  of  all  types  used  as  formal 
parameters  or  return  values  by  the  routines  in  the  object  after  the  last  analysis.  The 
resulting  candidate  object  is  shown  in  Figure  8.9. 

Figure  8.9  shows  the  extended  candidate  object  consisting  of  routines  affected 
by  changes  to  global  variable  lineno  and  those  routines  manipulating  an  abstract 
data  type  with  type  pointer  to  char. 

Some  of  the  design  recovery  capabilities  of  the  object  finder  are  observed  from 
this  example.  The  user  begins  with  a  starting  point,  an  initial  candidate  object,  and 
builds  an  agglomerative  understanding  of  the  dependencies  between  the  system  com- 
ponents by  adding  routines,  types,  or  data  to  this  initial  candidate  object  according 
to  the  analysis  performed.  During  the  analysis,  the  user  is  aided  by  the  connections 
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between  components  which  are  drawn  at  each  iteration  of  the  analysis.  If  the  user 
desires,  the  result  shown  in  Figure  8.9  can  be  further  analyzed  using  the  top-down 
analysis. 


CHAPTER  9 
CONCLUSIONS  AND  FURTHER  STUDY 

Identifying  object-like  features  in  code  written  in  conventional  programming 
languages  would  seem  to  be  a  useful  step  in  applying  object-oriented  methods  to  the 
reengineering  and  maintenance  of  old  code.  This  dissertation  has  described  a  strategy 
based  on  persistent  data  and  data  type  for  recovering  the  underlying  system  structure 
of  an  existing  program  by  identifying  the  data,  structures,  and  their  relationships. 
The  approach's  output  is  a  modularization  of  a  program  that  consists  of  the  collection 
of  "objects"  found  in  the  program. 

We  believe  that  tools  like  the  object  finder  can  be  a  very  useful  aid  in  maintain- 
ing and  reusing  old  code.  Useful  existing  programs  may  be  discarded  because  there  is 
no  easy  way  of  understanding  their  structure.  But  if  the  original  programmer  made 
good  use  of  modularity  and  abstract  data  typing  techniques,  tools  like  the  object 
finder  can  help  to  recover  these  aspects  of  the  design  and  thus  extend  the  life  and 
value  of  the  code. 

The  object  finder  is  somewhat  analogous  to  other  software  clustering  methods 
that  have  been  proposed  [14,  36]  but  we  believe  that  it  is  unique  in  searching  for 
a  particular  kind  of  cluster,  one  that  is  similar  to  an  abstract  data  type  or  to  an 
object  [5].  It  represents  a  narrow  but  useful  kind  of  software  design  recovery,  unlike 
many  of  the  more  sophisticated  and  much  more  complicated  general  methodologies 
that  have  been  proposed.  While  the  object  finder  is  obviously  only  a  beginning, 
we  believe  that  it  illustrates  the  potential  for  the  development  of  many  kinds  of 
interactive  design  recovery  tools. 
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This  dissertation  has  also  presented  the  experience  of  using  the  object  finder 
algorithms  and  an  evaluation  of  the  analysis  by  the  object  finder.  This  evaluation 
included  an  examination  of  the  results  of  this  approach  of  system  modularization. 
Two  studies  were  used  to  evaluate  the  identified  objects.  In  study  I,  we  compared 
the  groups  identified  using  the  object  finder  (candidate  objects)  with  those  groups 
identified  using  Hutchens  and  Basili's  hierarchical  clustering  [14]  (hierarchical  clus- 
ters). The  comparison  was  based  on  the  complexity  of  the  two  partitionings  resulting 
from  each  clustering  technique.  The  results  of  the  evaluation  included  a  new  sets  of 
complexity  metrics  factors:  the  inter-group  complexity  metrics  factors,  that  measured 
the  complexity  of  the  interface  of  a  group,  the  intra-group  complexity  metrics  factors, 
that  measured  the  internal  complexity  of  a  group,  and  the  intra-group  strength  fac- 
tors, that  measured  the  "functional  relatedness"  of  the  types  of  formal  parameters  or 
return  values  in  the  routines  in  a  group.  These  factors  measured  the  complexity  (sim- 
ilar to  coupling  and  cohesion)  of  a  given  group  in  a  partitioning.  The  newly  defined 
metrics  factors  were  validated  by  showing  that  the  complexity  metrics  factors  are 
sensitive  to  changes  in  the  structure  of  programs.  Three  "C"  programs  were  used  as 
test  cases  for  this  study.  The  conclusions  of  this  study  indicated  that  the  object  finder 
approach  results  in  a  modularization  of  a  program  which  exhibited  significantly  less 
inter-group  complexity  and  moderately  less  intra-group  complexity,  when  compared 
to  the  modularization  using  hierarchical  clustering  techniques.  The  comparison  did 
not  include  intra-group  strength  factors  since  hierarchical  clustering  does  not  consider 
the  characteristics  measured  by  these  factors  during  the  modularization. 

In  study  II,  we  compared  the  identified  objects  in  a  program  with  the  classes 
found  in  the  object-oriented  version  of  the  same  program.  To  ensure  that  the  program 
versions  were  functionally  equivalent,  an  object-oriented  program  (written  in  "C++" ) 
was  translated  into  an  equivalent  non  object-oriented  version  of  the  program.   The 


142 


results  of  study  II  indicated  that  the  identified  objects  by  the  object  finder  closely 
match  the  corresponding  object-oriented  language  classes.  For  instance,  there  was  a 
very  close  match  between  the  identified  objects  using  the  Globals-based  Object  Finder 
algorithm  and  the  "C++"  classes.  Furthermore,  the  identified  objects  illustrated 
the  design  recovery  capabilities  of  the  object  finder  by  providing  a  greater  level  of 
refinement  of  the  groups  in  a  program  than  the  classes  found  in  the  "C++"  program. 
For  example,  when  several  identified  objects  correspond  to  a  single  "C++"  class,  it 
represents  the  specialized  nature  of  the  functionality  of  the  set  of  routines  grouped 
into  the  identified  objects.  On  the  other  hand,  the  "C++"  version  of  the  program 
merged  all  the  routines  together  in  a  class. 

The  main  contributions  of  this  research  are  a  new  program  modularization  ap- 
proach based  on  type  binding  information,  an  efficient  approach  of  object  identifica- 
tion in  non  object-oriented  languages,  and  a  new  set  of  complexity  metrics  factors  to 
evaluate  the  results  of  the  approach.  The  evaluation  studies  in  this  dissertation  pro- 
vided the  guidelines  for  an  experiment  to  further  evaluate  these  object  identification 
methods. 

Future  improvements  to  the  current  prototype  interactive  software  tool,  which 
implements  these  object  identification  methods  for  the  "C"  language,  will  allow  us 
to  experiment  with  finding  objects  in  samples  of  industrial-size  code.  Our  intention 
is  to  collect  data  on  the  user's  decisions  and  his  reasons  for  making  them  so  that 
improved  heuristics  or  knowledge-based  methods  can  be  developed  to  improve  the 
process.  Further  evaluation  of  the  object  finder  algorithms  consists  of  an  statistical 
experimentation  using  production-size  programs  such  as  the  GNU  "C"  compiler  gcc. 
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In  addition,  further  study  is  needed  to  determine  additional  characteristics  that 
could  be  used  to  identify  the  object-like  features  of  conventional  programming  lan- 
guages based  on  object-oriented  design  principles  [5].  These  principles  include  classifi- 
cation of  the  objects  in  the  application,  organization  of  the  objects,  and  identification 
of  operations  on  the  objects.  The  knowledge  required  to  abstract  the  object-like  fea- 
tures of  programs  includes  semantic  knowledge  about  the  program;  for  example,  the 
fact  that  a  variable  is  used  as  a  "counter"  of  iterations  on  a  given  data  structure  is 
determined  from  a  semantic  analysis.  One  solution  is  collecting  semantic  knowledge 
about  object-oriented  programs  [45]  as  an  initial  knowledge  base  about  the  program. 
Moreover,  other  knowledge  useful  in  identifying  object-like  characteristics  consists  of 
a  detailed  analysis  of  the  internals  of  routines  in  the  system  such  as  data  flow  analysis 
of  the  routine.  The  future  study  will  also  extend  the  object  finder  to  consider  the 
inheritance  characteristics  of  objects  when  identifying  objects.  A  possible  approach 
to  reveal  the  inheritance  characteristics  between  identified  objects  is  to  record  the 
existence  of  relationships  (defined  by  common  routines)  among  objects.  The  common 
routines  indicate  that  a  common  functionality  is  shared  by  the  sharing  objects.  This 
commonality  could  be  represented  as  a  "generalized  object"  with  each  of  the  sharing 
objects  containing  routines  that  specialize  the  generalized  object. 

In  the  future  research,  we  propose  to  extend  the  evaluation  criteria  of  the  object 
finder  to  other  metrics  that  measure  factors  such  as  the  degree  of  modularization  [7] 
and  the  "object-orientedness"  of  the  abstracted  design.  The  latter  factors  would  be 
based  on  object-oriented  concepts  and  principles.  The  complexity  metrics  factors 
defined  in  this  dissertation  measured  several  design  qualities  of  objects  based  on 
their  cohesion  and  coupling,  including  information  hiding,  encapsulation  (state  and 
operations),  and  abstraction  (abstract  class).  Furthermore,  two  other  design  qualities 
of  the  identified  objects  are  inheritance  and  polymorphism  which  could  be  used  to 
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measure  their  re-usability.  Future  research  will  define  the  primitive  metrics  factors 
which  measure  these  design  qualities. 

An  efficient  approach  of  object  identification  in  non  object-oriented  languages 
has  been  presented.  The  proposed  Globals-based  Object  Finder  algorithm  has  poly- 
nomial time  complexity  in  terms  of  the  program  length  as  measured  by  the  number 
of  global  variables  and  the  number  of  routines  in  the  program.  Since  the  number  of 
global  variables  is  usually  small  compared  to  the  number  of  routines  in  the  program, 
the  method  is  practical  in  most  software  applications.  An  improvement  to  the  per- 
formance of  Algorithm  1  was  outlined  in  Chapter  4  and  the  graph  construction  step 
which  results  in  almost  linear  time  complexity  in  terms  of  the  number  of  routines. 
The  proposed  Types-based  Object  Finder  algorithm  has  linear  complexity  in  terms 
of  the  number  of  routines  and  the  number  of  (abstract  data)  types. 

Module-level  understanding  of  the  system  structure  is  mainly  needed  when  a 
maintainer  is  coming  to  grips  with  a  system  for  the  first  time.  This  understanding  will 
allow  him  to  sort  out  the  components  of  a  system  and  perceive  the  overall  architecture 
of  the  system.  In  addition,  he  may  build  a  framework  to  assists  him  in  making  sense 
of  the  more  detailed  information  he  will  acquire  during  the  maintenance  activity. 
Thus,  the  need  for  this  understanding  is  clear.  The  object  finder  methods  assists 
the  programmer  in  attaining  some  of  this  high  level  understanding.  We  have  proved 
that  our  methods  are  at  least  as  good  as  current  practices  for  system  modularization 
and  high  level  design  knowledge  abstraction.  They  provide  extra  help  in  considering 
type  information  to  provide  this  knowledge.  Even  though,  this  is  of  great  help, 
the  job  is  not  complete  yet.  We  foresee  more  complete  techniques  of  module-level 
program  understanding  using  our  methods  as  well  as  object-oriented  principles  for 
this  endeavor. 


APPENDIX  A 
OBJECT  FINDER  PROTOTYPE  USER'S  MANUAL 


A.l     Introduction 

This  is  a  user's  manual  for  a  GNU  Emacs-based  [1]  object  finder  prototype 
developed  at  the  Software  Engineering  Research  Center,  Department  of  Computer 
and  Information  Sciences,  University  of  Florida.  It  is  currently  running  on  a  Unix 
BSD  4.3  environment  with  GNU  Emacs  18.55  installed.  We  expect  no  problem  to 
install  it  to  other  Unix  systems  as  long  as  a  GNU  Emacs  version  18  is  installed. 

This  prototype  is  a  design  recovery  tool.  The  goal  of  the  object  finder  is  to  aid 
an  analyst  or  maintainer  in  recovering  certain  kinds  of  design  principles  and  system 
structures  from  source  code.  The  object  finder  looks  for  clusters  of  data,  structures 
and  routines,  that  are  analogous  to  the  objects  and  object  classes  of  object-oriented 
programming. 

The  programming  language  supported  is  ANSI  "C".  "Emacs-based"  means  that 
all  Emacs  features,  such  as  editing  several  files  simultaneously,  opening  multiple 
windows  on  the  same  document,  and  undoing  mistakes,  ...,  etc,  are  still  available.  It 
should  also  be  noted  that  the  prototype  assumes  and  expects  syntactically  correct 
programs  before  analysis  may  begin. 

Section  A. 2  of  this  manual  will  cover  the  necessary  setup  and  operational  pro- 
cedures. Also  included  are  definitions  of  the  system  buffers  and  system  files  in  Sec- 
tion A. 3. 

A.2     Operation 

In  this  section,  we  explain  how  to  operate  the  system  in  detail. 
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A.2.1     Basic  Setup  Commands 
1.  Invoke  GNU  Emacs: 


emacs 


2.  Initialize  this  prototype: 
M-x  erds 

A. 2.2     Object  Finder  Analysis 

A  user  now  may  generate  a  file  describing  the  system  to  be  analyzed.  Currently, 
the  prototype  analyzes  a  "C"  program  that  consists  of  a  collection  of  source  files  and 
a  mkf  ile  file  which  describes  how  the  program  is  constructed.  The  created  files  must 
be  saved  before  invoking  this  prototype.  This  prototype  is  invoked  by  using: 
ESC  ESC  ESC  o 

The  system  will  prompt  with  From:.    Type  the  file  name,  e.g.    mkf  ile.    An 
Emacs  buffer  is  created  for  each  data  file. 

Menu-driven  commands  are  available  to  perform  the  analysis. 

A  menu  will  be  displayed  in  the  minibuffer  (at  the  bottom  of  this  screen)  as 
follows: 
T:Top-down  analysis   B:Bottom-up   analysis? 

The  meaning  related  to  each  option  follows: 

•  T  or  t:  performs  top-down  analysis. 

•  B  or  b:  performs  bottom-up  analysis. 
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The  first  step  of  the  bottom-up  analysis  is  to  propose  a  starting  point  -  one 
or  more  routines,  types,  or  global  variables  to  be  part  of  an  initial  candidate  object. 
For  this,  the  system  asks  for  the  unique  ID  of  the  global  variables,  types,  or  routines 
until  done  is  entered. 

A.2.3     Display  Analysis  Results  and  Identified  Objects 

Menu-driven  commands  are  available  to  display  the  results  of  the  globals-based 
and  types-based  analysis.  The  user  defines  several  settings  to  control  the  appearance 
of  the  output  as  follows: 

•  Show  graph  G: Current  global  name  length    [CURRENT-VALUE]:    , 

set   _global-name-length  to  value: 

Determines  the  width  in  characters  of  the  global  name  column  in  graph  G. 

•  Show  graph  G: Current   global  name   length    [CURRENT-VALUE]:    , 

set   _global-name-length  to  value: 

Determines  the  width  in  characters  of  the  type  name  column  in  matrix  R. 

•  Show  graph  G: Current  column  width  is    [CURRENT- VALUE] :    , 

set   .column-width  to  value: 

Determines  the  width  in  characters  of  a  column  in  graph  G  or  matrix  R. 

In  any  of  the  choices  above,  just  press  return  to  choose  the  default  CURRENT- VALUE. 

After  the  analysis  results  are  displayed,  the  user  is  confirmed  to  display  the 
identified  candidate  objects  from  the  current  candidate  object: 
Show  candidate   objects? (y/n) 

Answer  y  to  display  candidate  objects;  otherwise,  answer  n. 

A.2.3.1     Top-down  analysis  and  display 

The  following  settings  are  specified  by  the  user  during  the  top-down  analysis: 
Globals   analysis   for  object   OBJECT-NAME   ? 
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Answer  y  if  globals-based  analysis  is  performed  on  the  program;  otherwise,  answer  n. 

Types   analysis  for   object   OBJECT-NAME   ? 

Answer  y  if  types-based  analysis  is  performed  on  the  program;  otherwise,  answer  n. 

Both  settings  are  displayed  back  to  the  user  and  the  system  waits  for  confirma- 
tion: 

Confirm  top  down  settings  above  for  object   OBJECT-NAME  ? 
Answer  y  if  the  user  agrees  with  the  settings;  otherwise,  answer  n. 

During  types-based  analysis  the  user  is  asked  whether  to  consider  the  primitive 
types  in  the  analysis. 

Type  analysis  grouping  with  basic  types? 
Answer  y  if  consider  primitive  types;  otherwise,  answer  n  to  ignore  primitive  types. 

The  user  may  further  refine  the  analysis  results,  in  a  top-down  fashion,  when 
the  top-down  analysis  menu  is  displayed: 
Continue(c) ; disable  globals(d) ,routines(f ) ,types(y) ;exit(e) ;quit(q) ; 

xobject(x)? 

The  meaning  related  to  each  option  follows: 

•  c:  Continues  top-down  analysis  and  invokes  the  object  selection  menu. 

•  d:  Disables  some  global  variables  from  the  analysis,  and  prompt  the  user  for 
the  names  (unique  ID)  of  the  global  variables  until  done  is  entered. 

•  f :  Disables  some  routines  from  the  analysis,  and  prompt  the  user  for  the  names 
(unique  ID)  of  the  routines  until  done  is  entered. 

•  y:  Disables  some  types  from  the  analysis,  and  prompt  the  user  for  the  names 
(unique  ID)  of  the  types  until  done  is  entered. 


•  e:  Exits  object  finder  saving  the  results  in  files. 
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•  q:  Quits  object  finder  without  saving  the  results  in  files. 

•  x:  Invokes  the  X  Window-based  graphics  user  interface  xobject  [31]. 

To  continue  the  top-down  analysis,  the  object  selection  choice  (after  c  has  been 
typed)  will  prompt  a  menu  line  as  follows: 

Choose  object  from  subobjects(l) ,any  object (t) ; repeat   object(r); 

read  object (o)? 

The  meaning  related  to  each  option  follows: 

•  1:   Chooses  the  candidate  object  from  the  identified  candidate  objects  of  the 
current  candidate  object. 

•  t:  Chooses  the  candidate  object  from  all  the  identified  candidate  objects  in  the 
complete  system  (a  tree). 

•  r:  Repeats  analysis  of  the  current  candidate  object. 

•  o:  Read  the  candidate  objects  of  the  current  candidate  object  from  files.  Usu- 
ally, after  xobject  has  modified  and  saved  these  candidate  objects  into  files. 

For  choices  1  and  t  above,  specify  the  chosen  candidate  object  OBJECT-NAME 
followed  by  any  file  extension  in  Section  A. 3.1. 

A.2.3.2     Bottom-up  analysis  and  display 

The  following  bottom-up  settings  are  set  by  the  user: 
Data-routine   (dr) ,   routine-data   (rd) ,   type-routine    (tr) , 
routine-type    (rt)? 
The  meaning  related  to  each  setting  follows: 
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•  dr  or  DR:  adds  to  the  current  object  the  set  of  all  routines  that  use  global 
variables  in  the  object. 

•  rd  or  RD:  adds  to  the  current  object  the  set  of  all  global  variables  used  by 
routines  in  the  object. 

•  tr  or  TR:  adds  to  the  current  object  the  set  of  all  routines  that  use  types  in  the 
object  for  formal  parameters  or  return  values. 

•  rt  or  RT:  adds  to  the  current  object  the  set  of  all  types  used  as  formal  parameters 
or  return  values  by  routines  in  the  object. 

Then,  the  following  settings  are  set  by  the  user  during  bottom-up  analysis: 
Globals  analysis  for  object   OBJECT-NAME  ? 

Answer  y  if  globals-based  analysis  is  performed  on  the  program;  otherwise,  answer  n. 
Types  analysis  for  object   OBJECT-NAME  ? 
Answer  y  if  types-based  analysis  is  performed  on  the  program;  otherwise,  answer  n. 

All  settings  are  displayed  back  to  the  user  and  the  system  waits  for  confirmation: 
Confirm  bottom-up  settings   above  for  object   object-name  ? 
Answer  y  if  the  user  agrees;  otherwise,  answer  n. 

During  types-based  analysis  the  user  is  asked  whether  to  consider  the  primitive 
types  in  the  analysis. 

Type  analysis  grouping  with  basic  types? 
Answer  y  if  consider  primitive  types;  otherwise,  answer  n  to  ignore  primitive  types. 

The  user  may  further  refine  the  analysis,  in  a  bottom-up  fashion,  when  the 
bottom-up  analysis  menu  is  displayed: 
Repeat (r) ; disable  globals (d) , routines (f ) , types (y) ;exit(e) ;quit(q) ; 

XObject(x)? 
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The  meaning  related  to  each  option  follows: 

•  r:  Repeats  analysis  of  the  current  candidate  object. 

•  d:   Disables  some  global  variables  from  the  analysis,  and  prompt  the  user  for 
the  names  (unique  ID)  of  the  global  variables  until  done  is  entered. 

•  f :  Disables  some  routines  from  the  analysis,  and  prompt  the  user  for  the  names 
(unique  ID)  of  the  routines  until  done  is  entered. 

•  y:  Disables  some  types  from  the  analysis,  and  prompt  the  user  for  the  names 
(unique  ID)  of  the  types  until  done  is  entered. 

•  e:  Exits  object  finder  saving  the  results  in  files. 

•  q:  Quits  object  finder  without  saving  the  results  in  files. 

•  x:  Invokes  the  X  Window-based  graphics  user  interface  xobject.  This  feature 
is  currently  disabled  during  bottom-up  analysis. 

A  more  extensive  explanation  on  the  GNU  Emacs  commands  is  given  in  the 
GNU  Emacs  Manual  [40]. 

A. 3     Buffer  Structure  and  Files 

A .3.1     System  Buffer  Structure 

1.  OB JECT-NAME .  src  —  the  akfile  file. 

2.  OBJECT-NAME .  stb  —  the  symbol  table  (internal  representation)  of  the  candi- 
date object. 

3.  OB  JECT-NAME.  xrf  —  the  acx  cross-reference  information  for  the  candidate  ob- 
ject. 
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4.  OBJECT-NAME .  rou  —  the  routines  in  the  candidate  object. 

The  routines  are  identified  by  their  unique  ID  and  are  saved  one-per-line. 

5.  OB JECT-NAME .  typ  —  the  types  in  the  candidate  object. 

The  types  are  identified  by  their  unique  ID  and  are  saved  one-per-line. 

6.  OB  JECT-NAME.  gbl  —  the  global  variables  in  the  candidate  object. 

The  global  variables  are  identified  by  their  unique  ID  and  are  saved  one-per-line. 

7.  OB  JECT-NAME.  pv  —  set  of  routines  which  directly  use  variable  x:  P(x). 
Contains  lines  of  the  format: 

(x  P(x)) 

for  each  global  shared  variable  x,  x  is  the  unique  ID  of  variable  x. 

8.  OB  JECT-NAME.  pe  —  set  of  edges  between  global  variables  in  graph  G  from  the 
globals-based  analysis.  Contains  lines  of  the  format: 

(xl  x2    (P(xl)    intersection  P(x2))) 

where  xl  and  x2  are  unique  ID's  of  global  variables  and  P(xl)    intersection  P(x2) 
is  the  set  of  routines  which  directly  use  both  xl  and  x2. 

9.  OBJECT-NAME. co  —  set  of  candidate  objects  generated  by  the  globals-based 
and  types-based  analysis.  Contains  lines  of  the  format: 

"OBJECT-NAME" 

10.  OB  JECT-NAME.  to  —  types  ordering  in  the  types  hierarchy.  Contains  lines  of  the 
format: 
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(type  subtype-attribute) 

where  type  is  the  unique  ID  of  a  type  that  is  a  super-type  of  the  types  listed 
in  subtype-attribute. 

11.  OBJECT-NAME .  r  —  set  of  entries  in  the  matrix  R  from  the  types-based  analysis. 
Contains  lines  of  the  format: 

(routine  type  R(routine,type)) 

12.  OBJECT-NAME .  PE  —  the  graph  G  from  the  globals-based  analysis. 
This  buffer  is  displayed. 

13.  OBJECT-NAME.R  —  the  matrix  R  from  the  types-based  analysis. 
This  buffer  is  displayed. 

14.  OB JECT-NAME .  CO  —  set  of  candidate  objects  generated  by  the  analysis. 
This  buffer  is  displayed. 

15.  OB  JECT-NAME.  Xbj  —  The  xobject  process  output  buffer. 

16.  OBJECT-NAME.log  —  the  system  run  time  information.   The  information  con- 
sists of  the  current  time  in  string  format  of  an  event  in  a  function. 

A.3.2     System  Files 

1.  analysis.el  -  Perform  the  top-down  and  bottom- up  analysis  and  the  globals- 
based  analysis. 

2.  buffers. el  -  Buffer  control  library. 

3.  demo.el  -  Demonstration  functions. 
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4.  erds-interface.el  -  Dump  output  functions. 

5.  erds-manual.el  -  Menu-driven  support  library. 

6.  erds.el  -  ERDS  mode. 

7.  init.el  -  System  initialization. 

8.  layout. el  -  Emacs  window  layout  control  functions. 

9.  ofmd.el  -  Object  finder  main  driver. 

10.  parser2.el  -  Parser  for  the  cross-reference  information  from  acx. 

11.  plists.el  -  Property  list  functions. 

12.  sets. el  -  Set  control  library. 

13.  show. el  -  Analysis  results  and  candidate  objects  display  functions. 

14.  sys-log.el  -  System  run-time  log. 

15.  types. el  -  Perform  the  types-based  analysis. 

16.  xref2.el  -  Main  driver  for  the  parser  of  the  cross-reference  information  from 
acx. 

17.  .emacs  -  An  Emacs  initial  setup  file. 

18.  .emacs.  -  A  directory  to  hold  all  Emacs  initialization  files. 

19.  erds.texinfo  -  This  user's  manual. 
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